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SUMMARY 


A new set of benchmarks has been developed for the performance evaluation of highly parallel 
supercomputers. These benchmarks consist of a set of kernels, the “Parallel Kernels,” and a simulated 
application benchmark. Together they mimic the computation and data movement characteristics of 
large scale computational fluid dynamics (CFD) applications. 

The principal distinguishing feature of these benchmarks is their “pencil and paper” specification — 
all details of these benchmarks are specified only algorithmically. In this way many of the difficulties 
associated with conventional benchmarking approaches on highly parallel systems are avoided. 




1 GENERAL REMARKS 


D. Bailey, D. Browning,* R. Carter, S. Fineberg,* and H. Simon* 


1.1 Introduction 

The Numerical Aerodynamic Simulation (NAS) Program, which is based at NASA Ames Research 
Center, is a large scale effort to advance the state of computational aerodynamics. Specifically, the 
NAS organization aims “to provide the Nation’s aerospace research and development community by the 
year 2000 a high-performance, operational computing system capable of simulating an entire aerospace 
vehicle system within a computing time of one to several hours” (ref. 1, p. 3). The successful solution 
of this “grand challenge” problem will require the development of computer systems that can perform 
the required complex scientific computations at a sustained rate nearly one thousand times greater 
than current generation supercomputers can now achieve. The architecture of computer systems able 
to achieve this level of performance will likely be dissimilar to the shared memory multiprocessing 
supercomputers of today. While no consensus yet exists on what the design will be, it is likely that the 
system will consist of at least 1000 processors computing in parallel. 

Highly parallel systems with computing power roughly equivalent to traditional shared memory 
multiprocessors exist today. Unfortunately, the performance evaluation of these systems on comparable 
types of scientific computations is very difficult for several reasons. Few relevant data are available 
for the performance of algorithms of interest to the computational aerophysics community on many 
currently available parallel systems. Benchmarking and performance evaluation of such systems has not 
kept pace with advances in hardware, software and algorithms. In particular, there is as yet no generally 
accepted benchmark program or even a benchmark strategy for these systems. 

The popular “kernel” benchmarks that have been used for traditional vector supercomputers, such 
as the Livermore Loops (ref. 2), the LINPACK benchmark (refs. 3 and 4) and the original NAS Kernels 
(ref. 5), are clearly inappropriate for the performance evaluation of highly parallel machines. First of 
all, the tuning restrictions of these benchmarks rule out many widely used parallel extensions. More 
importantly, the computation and memory requirements of these programs do not do justice to the vastly 
increased capabilities of the new parallel machines, particularly those systems that will be available by 
the mid-1990s. 

On the other hand, a full scale scientific application is similarly unsuitable. First of all, porting 
a large program to a new parallel computer architecture requires a major effort, and it is usually hard 
to justify a major research task simply to obtain a benchmark number. For that reason we believe that 
the otherwise very successful PERFECT Club benchmark (ref. 6) is not suitable for highly parallel 
systems. This is demonstrated by very sparse performance results for parallel machines in the recent 
reports (refs. 7, 8, and 9). 


•Computer Sciences Corporation, El Segundo, California. This work is supported through NASA Contract NAS 2-12961. 
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Alternatively, an application benchmark could assume the availability of automatic software tools 
for transforming “dusty deck” source into efficient parallel code on a variety of systems. However, 
such tools do not exist today, and many scientists doubt that they will ever exist across a wide range of 
architectures. 

Some other considerations for the development of a meaningful benchmark for a highly parallel 
supercomputer are the following: 

• Advanced parallel systems frequently require new algorithmic and software approaches, and 
these new methods are often quite different from the conventional methods implemented in source code 
for a sequential or vector machine. 

• Benchmarks must be “generic” and should not favor any particular parallel architecture. This 
requirement precludes the usage of any architecture-specific code, such as message passing code. 

• The correctness of results and performance figures must be easily verifiable. This requirement 
implies th at both input and output data sets must be kept very small. It also implies that the nature of 
the computation and the expected results must be specified in great detail. 

• The memory size and run time requirements must be easily adjustable to accommodate new 
systems with increased power. 

• The benchmark must be readily distributable. 

In our view, the only benchmarking approach that satisfies all of these constraints is a “paper and 
pencil” benchmark. The idea is to specify a set of problems only algorithmically. Even the input data 
must be specified only on paper. Naturally, the problem has to be specified in sufficient detail that a 
uni que solution exists, and the required output has to be brief yet detailed enough to certify that the 
problem has been solved correctly. The person or persons implementing the benchmarks on a given 
system are expected to solve the various problems in the most appropriate way for the specific system. 
The choice of data structures, algorithms, processor allocation and memory usage are all (to the extent 
allowed by the specification) left open to the discretion of the implementer. Some extension of Fortran 
or C is required, and reasonable limits are placed on the usage of assembly code and the like, but 
otherwise programmers are free to utilize language constructs that give the best performance possible 
on the particular system being studied. 

To this end, we have devised a number of relatively simple “kernels,” which are specified com- 
pletely in chapter 2 of this document. However, kernels alone are insufficient to completely assess the 
performance potential of a parallel machine on real scientific applications. The chief difficulty is that a 
certain data structure may be very efficient on a certain system for one of the isolated kernels, and yet 
this data structure would be inappropriate if incorporated into a larger application. In other words, the 
performance of a real computational fluid dynamics (CFD) application on a parallel system is critically 
dependent on data motion between computational kernels. Thus we consider the complete reproduction 
of this data movement to be of critical importance in a benchmark. 

Our benchmark set therefore consists of two major components: five parallel kernel benchmarks 
and three simulated application benchmarks. The simulated application benchmarks combine several 
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computations in a manner that resembles the actual order of execution in certain important CFD appli- 
cation codes. This is discussed in more detail in chapter 3. 

We feel that this benchmark set successfully addresses many of the problems associated with 
benchmarking parallel machines. Although we do not claim that this set is typical of all scientific 
computing, it is based on the key components of several large aeroscience applications used on su- 
percomputers by scientists at NASA Ames Research Center. These benchmarks will be used by the 
Numerical Aerodynamic Simulation (NAS) Program to evaluate the performance of parallel computers. 

1.2 Benchmark Rules 


1.2.1 Definitions 

In the following, the term “processor” is defined as a hardware unit capable of executing both 
floating point addition and floating point multiplication instructions. The “local memory” of a pro- 
cessor refers to randomly accessible memory that can be accessed by that processor in less than one 
microsecond. The term “main memory” refers to the combined local memory of all processors. This 
includes any memory shared by all processors that can be accessed by each processor in less than one 
microsecond. The term “mass storage” refers to non-volatile randomly accessible storage media that 
can be accessed by at least one processor within forty milliseconds. A “processing node” is defined 
as a hardware unit consisting of one or more processors plus their local memory, which is logically a 
single unit on the network that connects the processors. 

The term “computational nodes” refers to those processing nodes primarily devoted to high-speed 
floating point computation. The term “service nodes” refers to those processing nodes primarily devoted 
to system operations, including compilation, linking and communication with external computers over 
a network. 

122 General rules 

Implementations of these benchmarks must be based on either Fortran-77 or C, although a wide 
variety of parallel extensions are allowed. This requirement stems from the observation that Fortran and 
C are the most commonly used programming languages by the scientific parallel computing community 
at the present time. If in the future, other languages gain wide acceptance in this community, they will be 
considered for inclusion in this group. Assembly language and other low-level languages and constructs 
may not be used, except that certain specific vendor-supported assembly-coded library routines may be 
called (see section 1.2.3). 

We are of the opinion that such language restrictions are necessary, because otherwise considerable 
effort would be made by benchmarkers in low-level or assembly-level coding. Then the benchmark 
results would tend to reflect the amount of programming resources available to the benchmarking orga- 
nization, rather than the fundamental merits of the parallel system. Certainly the mainstream scientists 
that these parallel computers are intended to serve will be coding applications at the source level, almost 
certainly in Fortran or C, and thus these benchmarks are designed to measure the performance that can 
be expected from such code. 
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Accordingly, the following rules must be observed in any implementations of the NAS Parallel 
Benchmarks: 

• All floating point operations must be performed using 64-bit floating point arithmetic. 

• All benchm arks must be coded in either Fortran-77 (ref. 10) or C (ref. 11), with certain approved 
extensions. 

• Implementation of the benchmarks may not include a mix of Fortran-77 and C code — one or 
the other must be used. 

• Any extension of Fortran-77 that is in the Fortran-90 draft dated June 1990 or later (ref. 12) is 
allowed. 

• Any extension of Fortran-77 that is in the Parallel Computer Fortran (PCF) draft dated March 
1990 or later (ref. 13) is allowed. 

• Any extension of Fortran-77 that is in the High Performance Fortran (HPF) draft dated January 
1992 or later (ref. 14) is allowed. 

• Any language extension or library routine that is employed in any of the benchmarks must be 
supported by the vendor and available to all users. 

• Subprog rams and library routines not written in Fortran or C may only perform certain functions, 
as indicated in the next section. 

• All rules apply equally to subroutine calls, language extensions and compiler directives (i.e., 
special comments). 

UJ Allowable Fortran extensions and library routines 

Fortran extensions and library routines are also permitted that perform the following: 

• Indicate sections of code that can be executed in parallel or loops that can be distributed among 
different computational nodes. 

• Specify the allocation and organization of data among or within computational nodes. 

• Communicate data between processing nodes. 

• Communicate data between the computational nodes and service nodes. 

• Rearrange data stored in multiple computational nodes, including constructs to perform indirect 
addressing and array transpositions. 

• Synchronize the action of different computational nodes. 

• Initialize for a data communication or synchronization operation that will be performed or 
completed later. 
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• Perform high-speed input or output operations between main memory and the mass storage 
system. 

• Perform any of the following array reduction operations on an array either residing within a 
single computational node or distributed among multiple nodes: +, x, max, min, and, or, xor. 

• Combine communication between nodes with one of the operations listed in the previous item. 

• Perform any of the following computational operations on arrays either residing within a single 
computational node or distributed among multiple nodes: dense or sparse matrix-matrix multiplication, 
dense or sparse matrix-vector multiplication, one-dimensional, two-dimensional or three-dimensional 
fast Fourier transforms, sorting, block tri-diagonal system solution and block penta-diagonal system 
solution. Such routines must be callable with general array dimensions. 


13 Sample Codes 

The intent of this paper is to completely specify the computation to be carried out. Theoretically, a 
complete implementation, including the generation of the correct input data, could be produced from the 
information in this paper. However, the developers of these benchmarks are aware of the difficulty and 
time required to generate a correct implementation from scratch in this manner. Furthermore, despite 
several reviews, ambiguities in this technical paper may exist that could delay implementations. 

In order to reduce the number of difficulties and to aid the benchmarking specialist, Fortran-77 
computer programs implementing the benchmarks are available. These codes are to be considered 
examples of how the problems could be solved on a single processor system, rather than statements of 
how they should be solved on an advanced parallel system. The sample codes actually solve scaled 
down versions of the benchmarks that can be run on many current generation workstations. Instructions 
are supplied in comments in the source code on how to scale up the program parameters to the full size 
benchmark specifications. 

These programs, as well as the benchmark document itself, are available from the Systems Devel- 
opment Branch in the NAS Systems Division. Mail Stop 258-5, NASA Ames Research Center, Moffett 
Field, CA 94035-1000, attn: NAS Parallel Benchmark Codes. The sample codes are provided on 
Macintosh floppy disks and contain the Fortran source codes, “ReadMe” files, input data files, and ref- 
erence output data files for correct implementations of the benchmark problems. These codes have been 
validated on a number of computer systems ranging from conventional workstations to supercomputers. 

Three classes of problems are defined in this document. These will be denoted “Sample Code,” 
“Class A,” and “Class B,” since the three classes differ mainly in the sizes of principal arrays. Tables 1.1, 
1.2, and 1.3 give the problem sizes, floating point operation counts, memory requirements (measured in 
Mw), run time and performance rates (measured in MFLOPS) for each of the eight benchmarks and for 
the Sample Code, Class A, and Class B problem sets. These statistics are based on implementations on 
one processor of a Cray Y-MP. The operation count for the Integer Sort benchmark is based on integer 
operations rather than floating-point operations. The entries in the “Problem Size” columns are sizes of 
key problem parameters. Complete descriptions of these parameters are given in chapters 2 and 3. 
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Table 1.1. NAS Parallel Benchmarks Sample Code Statistics 


Benchmark code 

Problem 

size 

Memory 

(Mw) 

Time 

(sec) 

MFLOPS 

Embarrassingly parallel (EP) 

2 21 

0.1 

11.6 

120 

Multigrid (MG) 

32 s 

0.1 

0.1 

128 

Conjugate gradient (CG) 

1400 

1.0 

1.2 

63 

3-D FFT PDE (FT) 

64 3 

2.0 

1.2 

160 

Integer sort (IS) 

2 16 

03 

0.2 

30.5 

LU solver (LU) 

12 3 

0.3 

3.5 

28 

Pentadiagonal solver (SP) 

12 3 

0.2 

7.2 

24 

Block tridiagonal solver (BT) 

12 3 

03 

7.2 

34 


Table 1.2. NAS Parallel Benchmarks Class A Statistics 


Benchmark code 

Problem 

size 

Memory 

(Mw) 

Time 

(sec) 

MFLOPS 

Embarrassingly parallel (EP) 

~ 2 s8 

1 

151 

147 

Multigrid (MG) 

256 3 

57 

54 

154 

Conjugate gradient (CG) 

14000 

10 

22 

70 

3-D FFT PDE (FT) 

256 2 x 128 

59 

39 

192 

Integer sort (IS) 

2 23 

26 

21 

37.2 

LU solver (LU) 

64 3 

8 

344 

189 

Pentadiagonal solver (SP) 

64 3 

6 

806 

175 

Block tridiagonal solver (BT) 

64 3 

6 

923 

192 


Table 1.3. NAS Parallel Benchmarks Class B Statistics 


Benchmark code 

Problem 

size 

Memory 

(Mw) 

Time 

(sec) 

MFLOPS 

Embarrassingly parallel (EP) 

2 3U 

18 

512 

197 

Multigrid (MG) 

256 3 

59 

114 

165 

Conjugate gradient (CG) 

75000 

97 

998 

55 

3-D FFT PDE (FT) 

512x256 x 256 

162 

366 

195 

Integer sort (IS) 

2 25 

114 

126 

25 

LU solver (LU) 

102 3 

121 

1973 

162 

Pentadiagonal solver (SP) 

102 3 

36 

2160 

207 

Block tridiagonal solver (BT) 

102 3 

160 

3554 

203 
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1.4 Submission of Benchmark Results 


It must be emphasized that the sample codes described in section 1.3 are not the benchmark codes, 
but only implementation aids. For the actual benchmarks, the sample codes must be scaled to larger 
problem sizes. The sizes of the current benchmarks were chosen so that implementations are possible 
on currently available supercomputers. As parallel computer technology progresses, future releases of 
these benchmarks will specify larger problem sizes. 

The authors and developers of these benchmarks encourage submission of performance results for 
the problems listed in table 1.2. Periodic publication of the submitted results is planned. Benchmark 
results should be submitted to the Applied Research Branch, NAS Systems Division, Mail Stop T045-1, 
NASA Ames Research Center, Moffett Field, CA 94035, attn: NAS Parallel Benchmark Results. A 
complete submission of results should include the following: 

• A detailed description of the hardware and software configuration used for the benchmark runs. 

• A description of the implementation and algorithmic techniques used. 

• Source listings of the benchmark codes. 

• Output listings from the benchmarks. 
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2 THE KERNEL BENCHMARKS 


D. Bailey, E. Barszcz, L. Dagum,* P. Frederickson,^ R. Schreiber,^ and H. Simon* 


2.1 Overview 

After an evaluation of a number of large scale CFD and computational aerosciences applications 
on the NAS supercomputers at NASA Ames, a number of kernels were selected for the benchmark. 
These were supplemented by some other kernels which are intended to test specific features of parallel 
machines. The following benchmark set was then assembled: 

EP: An “embarrassingly parallel” kernel. It provides an estimate of the upper achievable limits 
for floating point performance, i.e., the performance without significant interprocessor communication. 

MG: A simplified multigrid kernel. It requires highly structured long distance communication and 
tests both short and long distance data communication. 

CG: A conjugate gradient method is used to compute an approximation to the smallest eigenvalue 
of a large, sparse, symmetric positive definite matrix. This kernel is typical of unstructured grid com- 
putations in that it tests irregular long distance communication, employing unstructured matrix vector 
multiplication. 

FT: A 3-D partial differential equation solution using FFTs. This kernel performs the essence of 
many “spectral” codes. It is a rigorous test of long-distance communication performance. 

IS: A large integer sort. This kernel performs a sorting operation that is important in “particle 
method” codes. It tests both integer computation speed and communication performance. 


These kernels involve substantially larger computations than previous kernel benchmarks, such as 
the Livermore Loops or Linpack, and therefore they are more appropriate for the evaluation of parallel 
machines. The Parallel Kernels in particular are sufficiently simple that they can be implemented 
on a new system without unreasonable effort and delay. Most importantly, as emphasized earlier, 
this set of benchmarks incorporates a new concept in performance evaluation, namely that only the 
computational task is specified, and that the actual implementation of the kernel can be tailored to the 
specific architecture of the parallel machine. 

In this chapter the Parallel Kernel benchmarks are presented, and the particular rules for allowable 
changes are discussed. Future reports will describe implementations and benchmarking results on a 
number of parallel supercomputers. 

‘Computer Sciences Corporation. This work is supported through NASA Contract NAS 2-12961. 
t Research Institute for Advanced Computer Science (RIACS), Ames Research Center. This work is supported by NAS 
Systems Division through Cooperative Agreement Number NCC 2-387. 
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22 Description of the Kernels 


22.1 Kernel EP: An embarrassingly parallel benchmark 
D. Bailey and P. Frederickson 
Brief Statement of Problem 

Generate pairs of Gaussian random deviates according to a specific scheme described below and 
tabulate the number of pairs in successive square annuli. 

Detailed Description 

Set n = 2 28 , a = 5 13 and s = 271, 828, 183. Generate the pseudorandom floating point values rj 
in the interval (0, 1) for 1 < j < 2n using the scheme described in section 2.3. Then for 1 < j < n set 
Xj = 2r2j_i - 1 and yj = 2r2j - 1. Thus Xj and yj are uniformly distributed on the interval (-1, 1). 

Next set k = 0. Then beginning with j = 1, test to see if tj = x? + < 1. If not, reject this pair 

and proceed to the next j. If this inequality holds, then set k «— k + 1, X k = xj yj (—2 log tj)/tj and 
Yk = yj ^(— 2 log Tj ) /tj, where log denotes the natural logarithm. Then X * and Y* are independent 
flanssian deviates with mean zero and variance one. Approximately mr / 4 pairs will be constructed in 
this manner. See reference 2, page 117 for additional discussion of this scheme for generating Gaussian 
deviates. 

Finally, for 0 < l < 9 tabulate Q t as the count of the pairs (X k ,Y k ) that lie in the square annulus 
l < maxflXfcl, |Yfc|) < l + 1, and output the ten Qi counts. 

This completes the definition of the Class A problem. The Class B problem is the same except 
that n = 2 30 . 

Verification Test 

Each of the ten Qi counts must agree exactly with reference values. For this value of n, the 
reference counts are as follows: 
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Class A 

Class B 

l 

Qi 

Qi 

0 

98257395 

393058470 

1 

93827014 

375280898 

2 

17611549 

70460742 

3 

1110028 

4438852 

4 

26536 

105691 

5 

245 

948 

6 

0 

5 

7 

0 

0 

8 

0 

0 

9 

0 

0 

Operations to be Timed 



All of the operations described above are to be timed, including tabulation and output. 

Other Features 

• This problem is typical of many Monte Carlo simulation applications. 

• The only requirement for communication is the combination of the 10 sums from various 
processors at the end. 

• Separate sections of the uniform pseudorandom numbers can be independently computed on 
separate processors. See section 2.3 for details. 

• The smallest distance between a floating-point value and a nearby integer among the rj, X k and 
Y k values is 3.2 x 10 -11 , which is well above the achievable accuracy using 64 bit floating arithmetic 
on existing computer systems. Thus if a truncation discrepancy occurs, it implies a problem with the 
system hardware or software. 

2.2.2 Kernel MG: a simple 3D multigrid benchmark 
EL Barszcz and P. Frederickson 
Brief Statement of Problem 

Four iterations of the V-cycle multigrid algorithm described below are used to obtain an approximate 
solution u to the discrete Poisson problem 


V 2 ti = v 

on a 256 x 256 x 256 grid with periodic boundary conditions. 
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Detailed Description 

Set v = 0 except at the twenty points listed in table 2.1. where t; = ±1. (These points were 
determined as the locations of the ten largest and ten smallest pseudorandom numbers generated as in 
Kernel FT.) 


Table 2.1. Nonzero values for v 


v u.k 

<MJ0 

- 1.0 

211,154, 98 

102,138,112 

101,156, 59 

17,205, 32 

92, 63,205 


199, 7,203 

250,170,157 

82,184,255 

154,162, 36 

223, 42,240 

+ 1.0 

57,120,167 

5,118,175 

176,246,164 

45,194,234 

212, 7,248 


115,123,207 

202, 83,209 

203, 18,198 

243,172, 14 

54,209, 40 


Begin the iterative solution with u = 0. Each of the four iterations consists of the following two 
steps, in which k = 8 = log2(256): 

r — v — A u ( evaluate residual ) 

u = u + M k r (apply correction) 

Here M * denotes the V-cycle multigrid operator, defined in table 2.2. In this definition A denotes the 


Table 2.2. V-cycle multigrid operator 


z k 

M k r k 

: 


if k > 1 

nt-i 

= Pr k 

(restrict residual) 


Zk - 1 

= M^n.i 

(recursive solve) 


z k 

= Q z k _i 

(prolongate) 


rk 

= r k - A z k 

(evaluate residual) 


z k 

= z k + Sr k 

(apply smoother) 

else 

Z 1 

= S rj. 

(apply smoother) 


trilinear finite element discretization of the Laplacian normalized as indicated in table 2.3, where 
the coefficients of P, Q , and S are also listed. 

In this table co denotes the central coefficient of the 27-point operator, when these coefficients are 
arranged as a 3 x 3 x 3 cube. Thus co is the coefficient that multiplies the value at the gridpoint (ij,k), 
while c\ multiplies the six values at grid points which differ by one in exactly one index, C 2 multiplies 
the next closest twelve values, those that differ by one in exactly two indices, and C 3 multiplies the eight 
values located at grid points that differ by one in all three indices. The restriction operator P given in 
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Table 2.3. Coefficients for trilinear finite element discretization 


c 

Co 

Cl 

C2 

c 3 

A 

— 8.0/3.0 

0.0 

1.0/6.0 

1.0/12.0 

P 

1.0/2.0 

1. 0/4.0 

1.0/8.0 

1.0/16.0 

Q 

1.0 

1.0/2.0 

1.0/4.0 

1.0/8.0 

S(a) 

— 3.0/8.0 

+1.0/32.0 

-1.0/64.0 

0.0 

S(b) 

-3.0/17.0 

+1.0/33.0 

-1.0/61.0 

0.0 


this table is the trilinear projection operator of finite element theory, normalized so that the coefficients of 
all operators are independent of level, and is half the transpose of the trilinear interpolation operator Q. 

Verification Test 

Class A: Evaluate the residual after four iterations of the V-cyde multigrid algorithm using the 
coefficients from the S(a) row for the smoothing operator S, and verify that its L 2 norm 

ll r 1 I 2 = [ ( E r ij* )/ 2563 ) 1/2 


agrees with the reference value 

0.2433365309 x 10 -05 

within an absolute tolerance of 10 -14 . 

Class B: The array size is the same as for Class A (256), but 20 iterations must be performed using 
the coefficients from the S(b) row for the smoothing operator S. The output L 2 norms must agree with 
the reference value 

0.180056440132 x 10 -05 

within an absolute tolerance of 10~ 14 . 

Timing 

Start the clock before evaluating the residual for the first time, and after initializing u and v. Stop 
the clock after evaluating the norm of the final residual, but before displaying or printing its value. 

2.23 Kernel CG: Solving an unstructured sparse linear system by the conjugate gradient method 

R. Schreiber, H. Simon, and R. Carter 

Brief Statement of Problem 

This benchmark uses the inverse power method to find an estimate of the largest eigenvalue of a 
symmetric positive definite sparse matrix with a random pattern of nonzeros. 
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Detailed Description 

In the following, A denotes the sparse matrix of order n, lower case Roman letters are vectors, Xj 
is the j th component of the vector x, and superscript “T” indicates the usual transpose operator. Lower 
case Greek letters are scalars. We denote by ||x|| the Euclidean norm of a vector x, ||x|| = Vx^x. All 
quantities are real. The inverse power method is to be implemented as follows: 


x = [1,1,..., l] r ; 

(start timing here) 

DO if = 1, niter 

Solve the system Az = x and return ||r||, as described below 
C = A + l/(x T z) 

Print if, ||r||, and C 
x = z/\\z\ I 

ENDDO 

(stop timing here) 


Values for the size of the system n, number of outer iterations niter, and the shift A for three 
different problem sizes are provided in table 2.4. The solution z to the linear system of equations Az - x 
is to be approximated using the conjugate gradient (CG) method. This method is to be implemented as 
follows: 


r = x 

T 

p = r x r 
V — r 

DO i = 1, 25 
q = Ap 

a = p/ij^q) 

z = z + ap 

PQ = P 
r = r — aq 
T 

p — r x r 

0 = P/PO 

p = r + /3p 

ENDDO 

compute residual norm explicitly: ||r|| = ||x — Az\\ 
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Table 2.4. Input parameters for CG benchmark 


Size 

n 

niter 

NONZER 

A 

Sample 

1400 

25 

7 

10 

Class A 

14000 

25 

11 

20 

Class B 

75000 

75 

13 

60 


Verification Test 

The program should print, at every outer iteration of the power method, the iteration number it, 
the eigenvalue estimate £, and the Euclidean norm ||r|| of the residual vector at the last CG iteration 
(the vector r in the discussion of CG above). For each size problem the computer value of £ must 
agree with the reference value Gref within a tolerance of 1.0 x 10” 10 , i.e., |C~CrefI < 1*0 x 10 -10 . 
These reference values Gref are provided in table 2.5. 


Table 2.5. Output parameters for CG benchmark 


Size 

Computed 

nonzeros 

FLOP 

xlO 9 

mem 

(MW) 

Gref 

Sample 

78148 

0.111 

1.0 

8.59717750786234 

Class A 

1853104 

2.50 

10.0 

17.13023505380784 

Class B 

13708072 

54.9 

96.7 

22.712745482078 


Tuning 

The reported time must be the wall-clock time required to compute all niter iterations and print 
the results, after the matrix is generated and downloaded into the system, and after the initialization of 
the starting vector x. 

It is permissible initially to reorganize the sparse matrix data structure ( arow , acol , aelt), which 
is produced by the matrix generation routine (described below), to a data structure better suited for the 
target machine. The original or the reorganized sparse matrix data structure can then be subsequently 
used in the conjugate gradient interation. Time spent in the initial reorganization of the data structure 
will not be counted towards the benchmark time. 

It is also permissible to use several different data structures for the matrix A, keep multiple copies 
of the matrix A, or to write A to mass storage and read it back in. However, the time for any data 
movements, which take place within the power iterations (outer iteration) or within the conjugate gradient 
iterations (inner iteration), must be included in the reported time. 
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Other Features 


The input sparse matrix A is generated by a Fortran 77 subroutine called makea, which is provided 
on the sample code disk described in section 1.3. In this program, the random number generator is 
initialized with a = 5 13 and s = 314159265. Then the subroutine makea is called to generate the 
matrix A. This program may not be changed. In the makea subroutine the matrix A is represented by 
the following Fortran 77 variables: 

n (integer) — the number of rows and columns 

nz (integer) — the number of nonzeros 

a (real *8 ) — array of NZ nonzeros 

ia (integer) — array of NZ row indices. Element A(K) is in row IA(K) for all 1 < K < NZ. 

ja (integer) — array of N+l pointers to the beginnings of columns. Column J of the matrix is stored 
in positions JA(J) through JA(J+1)-1 of A and IA. JA(N+1) contains NZ+1. 

The code generates the matrix as the weighted sum of N outer products of random sparse vectors 

x: 

N 

A = U{xx^ 
i = 1 

where the weights a;,- are a geometric sequence with = 1 and the ratio chosen so that = 0.1. 
The vectors x are chosen to have a few randomly placed nonzeros, each of which is a sample from 
the uniform distribution on (0, 1). Furthermore, the i tfl element of X{ is set to 1/2 to insure that A 
cannot be structurally singular. Finally, 0.1 is added to the diagonal of A. This results in a matrix 
whose condition number (the ratio of its largest eigenvalue to its smallest) is roughly 10. The number 
of randomly chosen elements of x is provided for each problem size in table 2.4, in the “NONZER” 
column. The final number of nonzeros of A are listed in table 2.5 in the “computed nonzeros” column. 
As implemented in the sample codes, the shift A of the main diagonal of A is the final task in subroutine 
makea. Values are provided for A in table 2.4. 

The data structures used are these. First, a list of triples (arow, acol, aelt) is constructed. Each of 
these represents an element in row i = arow, column j — acol, with value a,j = aelt. When the arow 
and acol entries of two of these triples coincide, then the values in their aelt fields are added together 
in creating a^j. The process of assembling the matrix data structures from the list of triples, including 
the process of adding coincident entries, is done by the subroutine sparse, which is called by makea 
and is also provided. For examples and more details on this sparse data structure, consult section 2.7 
of the book by Duff, Erisman, and Reid (ref. 3). 
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2.2.4 Kernel FT: A 3-D fast-Fourier transform partial differential equation benchmark 
D. Bailey and P. Frederickson 
Brief Statement of Problem 

Numerically solve a certain partial differential equation (PDE) using forward and inverse FFTs. 
Detailed Description 
Consider the PDE 

^M=«vW) 

where x is a position in three-dimensional space. When a Fourier transform is applied to each side, this 
equation becomes 

where v(z,t) is the Fourier transform of u(x, t). This has the solution 

v(z,t) = e- 4 ** 2 lA/(z,0) 


Now consider the discrete version of the original PDE. Following the above steps, it can be solved 
by computing the forward 3-D discrete Fourier transform (DFT) of the original state array u(x, 0), 
multiplying the results by certain exponentials, and then performing an inverse 3-D DFT. The forward 
DFT and inverse DFT of the ni x x 713 array it are defined respectively as 


7*3 — 1 Tl2 — 1 1*1— 1 

JW«) =E Z E Uj>ii ,e-2"W»l c -^r/n 2e -2«l./"3 

1=0 k= 0 j= 0 

. r»3— lri2— lnj — 1 

*£» = EEE w^ ,/ni ^ i ‘ r/ns ' 2,riWn3 

n l n 2 n 3 i = Q k=0 j—Q 


The specific problem to be solved for the Class A benchmark is as follows. Set ni = 256, n 2 = 
256, and 713 = 128. Generate 2711712713 64-bit pseudorandom floating point values using the pseudoran- 
dom number generator in section 2.3, starting with the initial seed 314159265. Then fill the complex 
"ray Uj,k,h 0 < j < n i, 0 < A; < 712 , 0 < l < 713 , with this data, where the first dimension varies 
most rapidly as in the ordering of a 3-D Fortran array. A single complex number entry of U consists of 
two consecutive pseudorandomly generated results. Compute the forward 3-D DFT of U, using a 3-D 
fast Fourier transform (FFT) routine, and call the result F. Set a = 10 ~ 6 and set t = 1. Then compute 

WjJkJ = e- iar2 ° 2+P+P> %>:,l 
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where j is defined as j for 0 < j < n\/2 and j — nj for n\f 2 < j < n\. The indices k and l 
are similarly defined with ri 2 and 713 . Then compute an inverse 3-D DFT on W , using a 3-D FFT 
routine, and call the result the array X. Finally, compute the complex checksum Xq,r,a where 

q = j (mod ni), r = 3j (mod n 2 ) and s = 5 j (mod n 3 ). After the checksum for this t has 
been output, increment t by one. Then repeat the above process, from the computation of W through 
the incrementing of t, until the step t = N has been completed. In this benchmark, N = 6. The V 
array and the array of exponential terms for t = 1 need only be computed once. Note that the array of 
exponential terms for t > 1 can be obtained as the t-th power of the array for t = 1. 

This completes the definition of the Class A problem. The Class B problem is the same except 
that ni = 512, n 2 = 256, n 3 = 256, and N = 20. 

Any algorithm may be used for the computation of the 3-D FFTs mentioned above. One algorithm 
is the following. Assume that the data in the input nj x n 2 x it 3 complex array A are organized so that 
for each k and l, all elements of the complex vector (Aj^, 0< j < wj) are contained within a single 
processing node. First perform an ni-point 1-D FFT on each of these n 2 n 3 complex vectors. Then 
transpose the resulting array into an n 2 x n 3 x nj complex array B. Next, perform an n 2 -point 1-D 
FFT on each of the n 3 ni first-dimension complex vectors of B. Again note that each of the 1-D FFTs 
can be performed locally within a single node. Then transpose the resulting array into an n 3 x n\ x n 2 
complex array C. Finally, perform an n 3 -point 1-D FFT on each of the nin 2 first-dimension complex 
vectors of C. Then transpose the resulting array into an ni x n 2 x n 3 complex array D. This array D 
is the final 3-D FFT result. 

Algorithms for performing an individual 1-D complex-to-complex FFT are well known and will 
not be presented here. Readers are referred to references 4, 5, 6, 7, and 8 for details. It should be noted 
that some of these FFTs are “unordered” FFTs, i.e., the results are not in the correct order but instead 
are scrambled by a bit-reversal permutation. Such FFTs may be employed if desired, but it should be 
noted that in this case the ordering of the exponential factors in the definition of above must 

be sim ilarly scrambled in order to obtain the correct results. Also, the final result array X may be 
scrambled, in which case the checksum calculation will have to be changed accordingly. 

It should be noted that individual 1-D FFTs, array transpositions, and even entire 3-D FFT operations 
may be performed using vendor-supplied library routines. See sections 1.2.2 and 1.2.3 for details. 


Operations to be Timed 

All of the above operations, including the checksum calculations, must be timed. 

Verification Test 

The N complex checksums must agree with reference values to within one part in 10 12 . For the 
parameter sizes specified above, the reference values are as follows: 
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Class A Class B 


t 

Real Part 

Imaginary Part 

t 

Real Part 

Imaginary Part 

1 

504.6735008193 

511.4047905510 

1 

517.7643571579 

507.7803458597 

2 

505.9412319734 

509.8809666433 

2 

515.4521291263 

508.8249431599 

3 

506.9376896287 

509.8144042213 

3 

514.6409228649 

509.6208912659 

4 

507.7892868474 

510.1336130759 

4 

514.2378756213 

510.1023387619 

5 

508.5233095391 

510.4914655194 

5 

513.9626667737 

510.3976610617 

6 

509.1487099959 

510.7917842803 

6 

513.7423460082 

510.5948019802 




7 

513.5547056878 

510.7404165783 




8 

513.3910925466 

510.8576573661 




9 

513.2470705390 

510.9577278523 




10 

513.1197729984 

511.0460304483 




11 

513.0070319283 

511.1252433800 




12 

512.9070537032 

511.1968077718 




13 

512.8182883502 

511.2616233064 




14 

512.7393733383 

511.3203605551 




15 

512.6691062020 

511.3735928093 




16 

512.6064276004 

511.4218460548 




17 

512.5504076570 

511.4656139760 




18 

512.5002331720 

511.5053595966 




19 

512.4551951846 

511.5415130407 




20 

512.4146770029 

511.5744692211 


Other Features 


• 3-D FFTs are a key part of certain CFD applications, notably large eddy turbulence simulations. 

• The 3-D FFT steps require considerable communication for operations such as array 
transpositions. 


2.2.5 Kernel IS: Parallel sort over small integers 
L. Dagum 

Brief Statement of Problem 

Sort N keys in parallel. The keys are generated by the sequential key generation algorithm given 
below and initially must be uniformly distributed in memory. The initial distribution of the keys can 
have a great impact on the performance of this benchmark, and the required distribution is discussed in 
detail below. 

Definitions 

A sequence of keys, {Ki \ i = 0, 1, . . . , N — 1}, will be said to be sorted if it is arranged in 
non-descending order, i.e., if, < K{ + \ < K i+ 2 . . .. The rank of a particular key in a sequence is 
the index value i that the key would have if the sequence of keys were sorted. Ranking , then, is the 
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process of arriving at a rank for all the keys in a sequence. Sorting is the process of permuting the the 
keys in a sequence to produce a sorted sequence. If an initially unsorted sequence, Ko, K \,. . . , ifjv-l 
has ranks r(0),r(l), .. . ,r(7V - 1), the sequence becomes sorted when it is rearranged in the order 
K r ( o), if r (i), • • • , # r (JV_l). Sorting is said t0 b® stable ke y s retain their original relative order. 

In other words, a sort is stable only if r(i) < r(j ) whenever K r (i) = K r(J ) 80(1 * < J* Stable sorting 
is not required for this benchmark. 

Memory Mapping 

The benchmark requires ranking an unsorted sequence of N keys. The initial sequence of keys 
will be generated in an unambiguous sequential manner as described below. This sequence must be 
mapped into the memory of the parallel processor in one of the following ways depending on the type 
of memory system. In all cases, one key will map to one word of memory. Word size must be no less 
than 32 bits. Once the keys are loaded onto the memory system, they are not to be moved or modified 
except as required by the procedure described in the Procedure subsection. 

Shared Global Memory All N keys initially must be stored in a contiguous address space. If A, is 
used to denote the address of the i tfl word of memory, then the address space must be [A*, A*+jy_i]. 
The sequence of keys, Kq, K\, . . . , -Kjv-l* initially must map to this address space as 

A i+j <— MEM{Kj) for j = 0,1,..., AT- 1 (2.1) 

where MEM(Kj) refers to the address of Kj. 

Distributed Memory In a distributed memory system with p distinct memory units, each memory unit 
initially must store N p keys in a contiguous address space, where 

N p = N/p (2.2) 


If Ai is used to denote the address of the i th word in a memory unit, and if Pj is used to denote 
the memory unit, then Pj Pi A{ will denote the address of the word in the memory unit. 
Some initial addressing (or “ordering”) of memory units must be assumed and adhered to throughout 
the benchmark. Note that the addressing of the memory units is left completely arbitrary. If N is not 
evenly divisible by p, then memory units {Pj \ j = 0, 1, . . . ,p— 2} will store N p keys, and memory 
unit P p -i will store Npp keys, where now 

N p = [N/p + 0.5J (2.3) 

Npp = N - (p— 1 )N P (2.4) 


In some cases (in particular if p is large) this mapping may result in a poor initial load balance with 
Npp » N p . In such cases it may be desirable to use p’ memory units to store the keys, where p 1 < p. 
This is allowed, but the storage of the keys still must follow either equation 2.2 or equations 2.3-2.4 
with p' replacing p. In the following we will assume N is evenly divisible by p. The address space in an 
individual memory unit must be [Aj, -l]- memory units are individually hierarchical, then N p 
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keys must be stored in a contiguous address space belonging to a single memory hierarchy and A{ then 
denotes the address of the i th word in that hierarchy. The keys cannot be distributed among different 
memory hierarchies until after timing begins. The sequence of keys, Ko, K \, . . . , initially must 

map to this distributed memory as 

P k (~l j4j _| .j < — M EM (K k N p +j) J = Oj 1 j • • ■ ? Np 1 

and k = 0, 1, . . . ,p — 1 (2 S) 


where MEM(K k ^ p+ j ) refers to the address of K kN + j. If N is not evenly divisible by p, then the 
mapping given above must be modified for the case wnere k = p — 1 as 


Pp- 1 H A i+ j < — MEM {K(p-\ )N p +j) f° r j = 0, 1, ... , Npp - 1. (2.6) 


Hierarchical Memory All N keys initially must be stored in an address space belonging to a single 
memory hierarchy which will here be referred to as the main memory. Note that any memory in the 
hierarchy which can store all N keys may be used for the initial storage of the keys, and the use of 
the term “main memory” in the description of this benchmark should not be confused with the more 
general definition of this term in section 1.2.1. The keys cannot be distributed among different memory 
hierarchies until after timing begins. The mapping of the keys to the main memory must follow one of 
either the shared global memory or the distributed memory mappings described above. 

The benchmark requires computing the rank of each key in the sequence. The mappings described 
above define the initial ordering of the keys. For shared global and hierarchical memory systems, the 
same mapping must be applied to determine the correct ranking. For the case of a distributed memory 
system, it is permissible for the mapping of keys to memory at the end of the ranking to differ from the 
initial mapping only in the following manner: the number of keys mapped to a memory unit at the end 
of the ranking may differ from the initial value, Np. It is expected, in a distributed memory machine, 
that good load balancing of the problem will require changing the initial mapping of the keys and for 
this reason a different mapping may be used at the end of the ranking. If N Pk is the number of keys 
in memory unit P k at the end of the ranking, then the mapping which must be used to determine the 
correct ranking is given by 


P k n A i+ j < — MEM(r(kN Pk + j)) for j — 0, 1, ... , N Pk - 1 

and fc = 0, 1, . . . , p — 1 (2.7) 

where r(kN Pk + j) refers to the rank of key K k ^ pk+ j. Note, however, this does not imply that the 
keys, once loaded into memory, may be moved. Copies of the keys may be made and moved, but the 
original sequence must remain intact such that each time the ranking process is repeated (Step 4 of 
Procedure) the original sequence of keys exists (except for the two modifications of Step 4a) and the 
same algorithm for ranking is applied. Specifically, knowledge obtainable from the communications 
pattern carried out in the first ranking cannot be used to speed up subsequent rankings and each iteration 
of Step 4 should be completely independent of the previous iteration. 
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Key Generation Algorithm 


The algorithm for generating the keys makes use of the pseudorandom number generator described 
in section 2.3. The keys will be in the range [0, Bmax )• Let ry be a random fraction uniformly 
distributed in the range [0, 1], and let K{ be the i th key. The value of Ki is determined as 


K{ < [5n»ox( r 4i+0 + r 4*+l + r 4i+2 4" r 4i+3)/^J ^ or * — 0,1,..., AT 1. (2.8) 

Note that K{ must be an integer and [ J indicates truncation. Four consecutive pseudorandom numbers 
from the pseudorandom number generator must be used for generating each key. All operations before 
the truncation must be performed in 64-bit double precision. The random number generator must be 
initialized with s = 314159265 as a starting seed. 

Partial Verification Test 

Partial verification is conducted for each ranking performed. Partial verification consists of com- 
paring a particular subset of ranks with the reference values. The subset of ranks and the reference 
values are given in table 2.6. Note that the subset of ranks is selected to be invariant to the ranking 
algorithm (recall that stability is not required in the benchmark). This is accomplished by selecting for 
verification only the ranks of unique keys. If a key is unique in the sequence (i.e., there is no other 
equal key), then it will have a unique rank despite an unstable ranking algorithm. The memory mapping 
described in the Memory Mapping subsection must be applied. 

Full Verification Test 

Full verification is conducted after the last ranking is performed. Full verification requires the 
following: 

1. Rearrange the sequence of keys, {ATj | i = 0, 1,...,JV — 1}, in the order {Kj \ j = 
r(0), r(l), . . . , r(N - 1)}, where r(0), r(l), . . . , r(N — 1) is the last computed sequence of ranks. 

2. For every K{ from i = 0 ... N — 2 test that K{ < 


If the result of this test is true, then the keys are in sorted order. The memory mapping described in the 
Memory Mapping subsection must be applied. 


Table 2.6. Values to be used for partial verification 


Rank (full) 

Full scale 

Rank (sample) 

Sample code 

r (21 12377) 

104 + t 

r (48427) 

0 + 

r(662041) 

17523 + i 

r(17148) 

18 + t 

r(5336171) 

123928 + i 

r (23627) 

346 + t 

r( 3642833) 

8288932 - i 

r (62548) 

64917 - i 

r(4250760) 

8388264 - i 

r(4431) 

65463 - i 
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Procedure 


1. In a scalar sequential manner and using the key generation algorithm described above, generate 
the sequence of N keys. 

2. Using the appropriate memory mapping described above, load the N keys into the memory 
system. 

3. Begin timing. 

4. Do, for i = 1 to Imax 

a. Modify the sequence of keys by making the following two changes: 

Ki <— t (2.9) 

Ki+I m ax * ( Bmax ~ i) ( 2 . 10 ) 

b. Compute the rank of each key. 

c. Perform the partial verification test described above. 

5. End timing. 

6. Perform full verification test described above. 


Specifications 

The specifications given in table 2.7 shall be used in the benchmark. Two sets of values are given, 
one for Class A and one for Class B. 

Table 2.7. Parameter values to be used for benchmark 


Parameter 

Class A 

Class B 

N 


2 Zb 

Bmax 

2 19 

2 21 

seed 

314159265 

314159265 

Imax 

10 

10 


For partial verification, the reference values given in table 2.6 are to be used. In this table, r(j) 
refers to the rank of Kj and i is the iteration of Step 4 of the Procedure. Again two sets of values are 
given, the Full Scale set being for the actual benchmark and the Sample Code set being for development 
purposes. It should be emphasized that the benchmark measures the performance based on use of the 
Full Scale values, and the Sample Code values are given only as a convenience to the implementor. 
Also to be supplied to the implementor is Fortran 77 source code for the sequential implementation of 
the benchmark using the Sample Code values and with partial and full verification tests. 
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23 A Pseudorandom Number Generator for the Parallel NAS Kernels 

D. Bailey 

Suppose that n unifo rm pseudorandom numbers are to be generated. Set a = 5* 3 and let xo = a 
be a specified initial “seed,” i.e., an integer in the range 0 < s < 2 4 ®. Generate the integers x k for 
1 <k<n using the linear congruential recursion 

Xfc+i = ax k (mod 2 46 ) 

and return r k = 2 _46 x fc as the results. Thus 0 < r k < 1, and the r k are very nearly uniformly 
distributed on the unit interval. See reference 2, beginning on page 9 for further discussion of this type 
of pseudorandom number generator. 

Note that any particular value x k of the sequence can be computed directly from the initial seed s by 
luting the binary algorithm for exponentiation, taking remainders modulo 2 46 after each multiplication. 
To be specific, let m be the smallest integer such that 2 m > k, set 6 = s and t = a. Then repeat the 
following for i from 1 to m: 

3 - k/2 

6 <— bt (mod 2 46 ) if 2 j ^ k 
t +- t 2 (mod 2 46 ) 

k^j 

The final value of b is x k = a k s (mod 2 46 ). See reference 2 for further discussion of the binary 
algorithm for exponentiation. 

The operation of multiplying two large integers modulo 2 46 can be implemented using 64 bit 
floating point arithmetic by splitting the arguments into two words with 23 bits each. To be specific, 
suppose one wishes to compute c = ab (mod 2 4 ®). Then perform the following steps, where int 
denotes the greatest integer: 

ai «- int (2 -23 a) 
a 2 *— a — 2 23 ai 
bi <- int (2 -23 6) 
b2 «- b - 2 23 £>i 
tl ♦— aii>2 + a 2^l 
t 2 *- int (2 _23 ti) 

H - h ~ 2 23 *2 
t 4 *- 2 23 t 3 + a 2 i>2 
int ( 2 -46 t 4 ) 
c ♦ — t 4 — 2 46 t 5 
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An implementation of the complete pseudorandom number generator algorithm using this scheme 
produces the same sequence of results on any system that satisfies the following requirements: 

• The input multiplier a and the initial seed s, as well as the constants 2 23 , 2 -23 , 2 46 and 2 -46 , 
can be represented exactly as 64 bit floating point constants. 

• The truncation of a nonnegative 64 bit floating point value less than 2 24 is exact. 

• The addition, subtraction and multiplication of 64 bit floating point values, where the arguments 
and results are nonnegative whole numbers less than 2 47 , produce exact results. 

• The multiplication of a 64 bit floating point value, which is a nonnegative whole number less 
than 2 47 , by the 64 bit floating point value 2 -m , 0 < m < 46, produces an exact result. 


These requirements are met by virtually all scientific computers in use today. Any system based on the 
IEEE-754 floating point arithmetic standard 1 easily meets these requirements using double precision. 
However, it should be noted that obtaining an exact power of two constant on some systems requires a 
loop rather than merely an assignment statement with **. 

Other Features 

• The period of this pseudorandom number generator is 2 44 = 1.76 x 10* 3 , and it passes all 
reasonable statistical tests. 

• This calculation can be vectorized on vector computers by generating results in batches of size 
equal to the hardware vector length. 

• By using the scheme described above for computing x * directly, the starting seed of a partic- 
ular segment of the sequence can be quickly and independently determined. Thus numerous separate 
segments can be generated on separate processors of a multiprocessor system. 

• Once the IEEE-754 floating point arithmetic standard gains universal acceptance among scien- 

tific computers, the radix 2 46 can be safely increased to 2 52 , although the scheme described above for 
multiplying two such numbers must be correspondingly changed. This will increase the period of the 
pseudorandom sequence by a factor of 64 to approximately 1.13 x 10 15 . 
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3 A METHODOLOGY FOR BENCHMARKING SOME CFD KERNELS ON 

HIGHLY PARALLEL PROCESSORS 


Sisira Weeratunga,* Eric Barszcz, Rod Fatoohi,* and V. Venkatakrishnan* 


Summary 

A collection of iterative PDE solvers embedded in a pseudo application program is proposed for the 
performance evaluation of CFD codes on highly parallel processors. The pseudo application program 
is stripped of complexities associated with real CFD application programs, thereby enabling a simpler 
description of the algorithms. However, it is capable of reproducing the essential computation and 
data motion characteristics of large scale, state of the art CFD codes. In this chapter, we present a 
detailed description of the pseudo application program concept. Preceding chapters address our basic 
approach towards the performance evaluation of parallel supercomputers targeted for use in numerical 
aerodynamic simulation. 


3.1 Introduction 

Computational Fluid Dynamics (CFD) is one of the fields in the area of scientific computing that 
has driven the development of modem vector supercomputers. Availability of these high performance 
computers has led to impressive advances in the state of the art of CFD, both in terms of the physical 
complexity of the simulated problems and the development of computational algorithms capable of 
extracting high levels of sustained performance. However, to carry out the computational simulations 
essential for future aerospace research, CFD must be able and ready to exploit potential performance 
and cost/performance gains possible through the use of highly parallel processing technologies. Use of 
parallel supercomputers appears to be one of the most promising avenues for realizing large complex 
physical simulations within realistic time and cost constraints. Although many of the current CFD 
application programs are amenable to a high degree of parallel computation, performance data on such 
codes for the current generation of parallel computers often has been less than remarkable. This is 
especially true for the class of CFD algorithms involving global data dependencies, commonly referred 
to as the implicit methods. Often the bottleneck is data motion, due to high latencies and inadequate 
bandwidth. 

It is a common practice among computer hardware designers to use the dense linear equation 
solution subroutine in the UNPACK to represent the scientific computing workload. Unfortunately, 
the computational structures in most CFD algorithms bear little resemblance to this UNPACK routine, 
both in terms of its parallelization strategy as well as floating point and memory reference features. 
Most CFD application codes are characterized by their use of either regular or irregular sparse data 
structures and associated algorithms. One of the reasons for this situation is the near absence of com- 
munication between computer scientists engaged in the design of high performance parallel computers 
and the computational scientists involved in the development of CFD applications. To be effective, 
such exchange of information should occur during the early stages of the design process. It appears 
that one of the contributing factors to this lack of communication is the complexity and confidentiality 

* Computer Sciences Corp., Ames Research Center. 
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associated with the state-of-the-art CFD application codes. One way to help the design process is to 
provide the computer scientists with synthetic CFD application programs, which lack the complexity of 
a real application, but at the same time retain all the essential computational structures. Such synthetic 
application codes can be accompanied by detailed and simpler descriptions of the algorithms involved. 
In return, the performance data on such synthetic application codes can be used to evaluate different 
parallel supercomputer systems at the procurement stage by the CFD community. 

Computational fluid dynamics involves the numerical solution of a system of nonlinear partial 
differential equations in two or three spatial dimensions, with or without time dependence. The gov- 
erning partial differential equations, referred to as the Navier-Stokes equations, represent the laws of 
conservation of mass, momentum and energy applied to a fluid medium in motion. These equations, 
when supplemented by appropriate boundary and initial conditions, describe a particular physical prob- 
lem. To obtain a system of equations amenable to solution on a computer requires the discretization of 
the differential equations through the use of finite difference, finite volume, finite element or spectral 
methods. The inherent nonlinearities of the governing equations necessitate the use of iterative solution 
techniques. Over the past years, a variety of efficient numerical algorithms have been developed, all 
requiring many floating point operations and large amounts of computer memory to achieve a solution 
with the desired level of accuracy. 

In current CFD applications, there are two types of computational meshes used for the spatial 
discretization process: structured and unstructured. Structured meshes are characterized by a consistent, 
logical ordering of mesh points, whose connectivity is associated with a rectilinear coordinate system. 
Computationally, structured meshes give rise to regularly strided memory reference characteristics. In 
contrast, unstructured meshes offer greater freedom in terms of mesh point distribution, but require the 
generation and storage of random connectivity information. Computationally, this results in indirect 
memory addressing with random strides, with its attendant increase in memory bandwidth requirements. 
The synthetic application codes currently under consideration are restricted to the case of structured 
meshes. 

The numerical solution algorithms used in CFD codes can be broadly categorized as either explicit 
or implicit, based on the procedure used for the time domain integration. Among the advantages of 
the explicit schemes are the high degree of easily exploitable parallelism and the localized spatial 
data dependencies. These properties have resulted in highly efficient implementations of explicit CFD 
algorithms on a variety of current generation highly parallel processors. However, the explicit schemes 
suffer from stringent numerical stability bounds and as a result are not optimal for problems that 
require fine mesh spacing for numerical resolution. In contrast, implicit schemes have less stringent 
stability bounds and are suitable for problems involving highly stretched meshes. However, their 
parallel implementation is more difficult and involve local as well as global spatial data dependencies. 
In addition, some of the implicit algorithms possess limited degrees of exploitable parallelism. At 
present, we restrict our synthetic applications to three different representative implicit schemes found in 
a wide spectrum of production CFD codes in use at NASA Ames Research center. 

In the remaining sections of this chapter, we describe the development of a collection of synthetic 
application programs. First we discuss the rationale behind this approach followed by a complete 
description of three such synthetic applications. We also outline the problem setup along with the 
associated verification tests, when they are used to benchmark highly parallel systems. 
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32 Rationale 


In the past, vector supercomputer performance was evaluated through the use of suites of kernels 
chosen to characterize generic computational structures present at a site’s workload. For example, 
NAS Kernels (ref. 1) were selected to characterize the computational workloads inherent in a majority 
of algorithms used by the CFD community at Ames Research Center. However, for highly parallel 
computer systems, this approach is inadequate for the reasons outlined below. 

The first stage of the pseudo application development process was the analysis of a variety of 
implicit CFD codes and the identification of a set of generic computational structures that represented a 
range of computational tasks embedded in them. As a result, the following computational kernels were 
selected: 

1. Solution of multiple, independent systems of nondiagonally-dominant, block tridiagonal equa- 
tions with a (5 x 5) block size. 

2. Solution of multiple, independent systems of nondiagonally-dominant, scalar pentadiagonal 
equations. 

3. Regular-sparse, block (5 x 5) matrix-vector multiplication. 

4. Regular-sparse, block (5x5) lower and upper triangular system solution. 

These kernels constitute a majority of the computationally-intensive, main building blocks of the 
CFD programs designed for the numerical solution of three-dimensional (3D), Euler/Navier-Stokes 
equations using finite-volume/finite-difference discretization on structured grids. Kernels (1) and (2) are 
representative of the computations associated with the implicit operator in versions of the ARC3D code 
(ref. 2). These kernels involve global data dependencies. Although they are similar in many respects, 
there is a fundamental difference with regard to the communication-to-computation ratio. Kernel (3) 
typifies the computation of the explicit part of almost all CFD algorithms for structured grids. Here 
all data dependencies are local, with either nearest neighbor or at most next-to-nearest neighbor type 
dependencies. Kernel (4) represents the computations associated with the implicit operator of a newer 
class of implicit CFD algorithms, typified by the code INS3D-LU (ref. 3). This kernel may contain 
only a limited degree of parallelism, relative to the other kernels. 

In terms of their parallel implementation, these kernels represent varying characteristics with regard 
to the following aspects, which are often related: 

1. Available degree of parallelism. 

2. Level of parallelism and granularity. 

3. Data space partitioning strategies. 

4. Global vs. local data dependencies. 

5. Inter-processor and in-processor data motion requirements. 

6. Ratio of communication to computation. 
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Previous research efforts in adapting algorithms in a variety of flow solvers to the current gen- 
eration of highly parallel processors have indicated that the overall performance of many CFD codes 
is critically dependent on the latency and bandwidth of both the in-processor and inter-processor data 
motion. Therefore, it is important for the integrity of the benchmarking process to faithfully reproduce 
a majority of the data motions encountered during the execution of applications in which these kernels 
are embedded. Also, the nature and amount of data motion is dependent on the kernel algorithms along 
with the associated data structures and the interaction of these kernels among themselves as well as with 
the remainder of the application that is outside their scope. 

To obtain realistic performance data, specification of both the incoming and outgoing data structures 
of the kernels should mimic those occuring in an application program. The incoming data structure is 
dependent on the section of the code where the data is generated, not on the kernel. The optimum data 
structure for the kernel may turn out to be suboptimal for the code segments where the data is generated 
and vice versa. Similar considerations also apply to the outgoing data structure. Allowing the freedom 
to choose optimal incoming and outgoing data structures for the kernel as a basis for evaluating its 
performance is liable to produce results that are not applicable to a complete application code. The 
overall performance should reflect the cost of data motion that occur between kernels. 

In order to reproduce most of the data motions encountered in the execution of these kernels in a 
typical CFD application, we propose embedding them in a pseudo application code. It is designed for 
the numerical solution of a synthetic system of nonlinear Partial Differential Equations (PDEs), using 
iterative techniques similar to those found in CFD applications of interest to NASA Ames Research 
Center. However, it contains none of the pre- and post-processing required by the full CFD applications, 
or the interactions of the processors and the I/O subsystem. This can be regarded as a stripped-down 
version of a CFD application. It retains the basic kernels that are the principal building blocks of 
the application and admits a majority of the interactions required between these basic routines. Also, 
the stripped-down version does not represent a fully configured CFD application in terms of system 
memory requirements. This fact has the potential for creating data partitioning strategies during the 
parallel implementation of the synthetic problem that may be inappropriate for the full application. 

From the point of view of functionality, the stripped-down version does not contain the algorithms 
used to apply boundary conditions as in a real application. It is well known that often the boundary 
algorithms gives rise to load imbalances and idling of processors in highly parallel systems. Due to the 
simplification of the boundary algorithms, it is likely that the overall system performance and efficiency 
data obtained using the stripped-down version may be higher than that of an actual application. This 
effect is somewhat mitigated by the fact that for most realistic problems, a relatively small time amount 
of is spent dealing with boundary algorithms when compared to the time spent in dealing with the 
internal mesh points. Also, most boundary algorithms involve only local data dependencies. 

Some of the other advantages of the stripped-down application vs. full application approach are: 

1. Allows benchmarking where real application codes are confidential. 

2. Easier to manipulate and port from one system to another. 

3. Since only the abstract algorithm is specified, facilitates new implementations that are tied 
closely to the architecture under consideration. 
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4. Allows easy addition of other existing and emerging CFD algorithms to the benchmarking 
process. 

5. Easily scalable to larger problem sizes. 


It should be noted that this synthetic problem differs from a real CFD problem in the following 
important aspects: 

1. In full CFD application codes, a non-orthogonal coordinate transformation (ref. 2) is used to 
map the complex physical domains to the regular computational domains, thereby introducing metric 
coefficients of the transformation into the governing PDE’s and boundary conditions. Such transforma- 
tions are absent in the synthetic problem, and as a result may have reduced arithmetic complexity and 
storage requirements. 

2. A blend of nonlinear, second- and fourth-difference artificial dissipation terms (ref. 4) is used 
in most of the actual CFD codes, whose coefficients are determined based on the local changes in 
pressure. In the stripped-down version, only a linear, fourth difference term is used. This reduces the 
arithmetic and communication complexity needed to compute the added higher-order dissipation terms. 
However, it should be noted that computation of these artificial dissipation terms involve only local data 
dependencies, similar to the matrix-vector multiplication kernel. 

3. In codes where artificial dissipation is not used, upwind differencing based on either flux-vector 
splitting (refs. 5 and 6), flux-difference splitting (ref. 7) or Total Variation Diminishing (TVD) schemes 
(ref. 8) is used. The absence of such differencing schemes in the stripped-down version induces effects 
similar to (2) on the performance data. 

4. Absence of turbulence models. Computation of terms representing some turbulence models 
involve a combination of local and some long-range data dependencies. Arithmetic and communication 
complexity associated with turbulence models are absent. 


In addition, it needs to be emphasized that the stripped-down problem is neither designed nor is 
suitable for the purposes of evaluating the convergence rates and/or the applicability of various iterative 
linear system solvers used in computational fluid dynamics applications. As mentioned before, the 
synthetic problem differs from the real CFD applications in the following important ways: 

1. Absence of realistic boundary algorithms. 

2. Higher than normal dissipative effects. 

3. Lack of upwind differencing effects, based on either flux-vector splitting or TVD schemes. 

4. Absence of geometric stiffness introduced through boundary conforming coordinate transfor- 
mations and highly stretched meshes. 

5. Lack of evolution of weak (i.e., C°— ) solutions found in real CFD applications, during the 
iterative process. 

6. Absence of turbulence modelling effects. 
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Some of these effects tend to suppress the predominantly hyperbolic nature exhibited by the Navier- 
Stokes equations, when applied to compressible flows at high Reynolds numbers. 

3 3 Mathematical Problem Definition 

We consider the numerical solution of the following synthetic system of five nonlinear partial 
differential equations (PDEs): 


d\J 0E(U) ar(U) 0G(U) 

&t ~ d( + a 17 + ac 

ar(u,u ( ) av( u,u„) aw(u,u c ) (3.10) 

a( at/ a c 

+ H(U,U t ,U„U t ), e Dr x D 

with the boundary conditions: 

B(U,U { ,U„U C ) = U b (t, (t, (,</,() eVrxdV (3.16) 

and initial conditions: 

B = U°K,I),C)> («,•?> C )€B for T = 0 (3.1c) 

where D € IR 3 is a bounded domain, dD is its boundary and D T = {0 < r < T}. 


Also, the solution to the system of PDEs: 


/u (1) \ 

u( 2 ) 


U = 


tt(S) 


u( 4 ) 

U (5 >> 


(3.2) 


defined in ( D U dD ) x D r , is a vector function of temporal variable r and spatial variables (£, 77, C) 
that form the orthogonal coordinate system in Ift 3 , i.e.; 


u W =I1 (m)( T , f ,, iC ) 
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The vector functions V B and U° are given and B is the boundary operator. E, F, G, T, V, W and 
H are vector functions with five components each of the form: 


f ^ 

e( 2 ) 


E = 


cW 


e< 4 > 


V e( 5 > / 


(3.3) 


and e( m ) = e( m )(U) etc., are prescribed functions. 

The system given by equation (3.1a) is in the ‘normal-form’, i.e., it gives explicitly the time 
derivatives of all the dependent variables Consequently, the Cauchy data at r = 0, 

given by equation (3.1c) permits the calculation of solution U(r,£,» 7 , £) for r > 0. 

In the current implementation of the synthetic PDE system solver, we seek a steady-state solution 
of equation (3.1) of the form: 


U*=f(£,T7,C) = 


/( 2 ) 

/( 3 ) 
y(4) 

V/wy 

where /( m ) = /( m )(£, 7 ;, £) are prescribed functions of the following form: 
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Here, the vector e is given by 
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and Cm f n, m = 1,2,..., 5 ,n = 1,2, ...,13 are specified constants. The vector forcing function 

H = where h( m ) = /i( m ) (f, rj , £) is chosen such that the system of PDE’s, 

along with the boundary and initial conditions for H, satisfies the prescribed exact solution, U*. This 
implies 




+ 


d£ ' di / ’ 

ar(UM%) av(ir,u;) aw(u*,u*) 

d£ drj dC 


(3.6) 


], for (£, rj, C) € D x D t 


The bounded spatial domain D is specified to be the interior of the unit cube [(0, 1) x (0, 1) x (0, 1)], 
i.e.: 

D = {(*,»?, C) : 0 < £ < 1,0 < r/ < 1,0 < C < 1} 
and its boundary dD, is the surface of the unit cube given by 

dD = {(&»?> C) : f = 0 or l}U{(f,T?,C) : 7 / = 0 or 1 } U { 77 , C) : C = 0 or 1} 

The vector functions E, F, G, T, V and W of the synthetic problem are specified to be the following: 


E = 


where 
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T = 


dfiduM/dt) + (4./3.)k 3 k 4 (d[uW/uM)/dt) 
df\duW/dt) + fc 3 ^4(5[« (3) /« (1) ]/^) 
df\duW/dO + k 3 k 4 (d[uW/uM]/dz) 

«( 5) 


where 
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and fcj, k 2 , h, fc 4) k 5 , d[ m) , dij m \ d[ m) (m = 1, 2, . . . , 5) are given constants. 

33.1 Hie boundary conditions 

The boundary conditions for the system of PDEs is prescribed to be of the uncoupled Dirichlet 
type, and is specified to be compatible with IT, such that 

«("■) = /(”*)({, >,,C) for (r,«,,,C)eD T xaC ( 3 . 7 ) 

and m = 1,2, ... ,5. 


3 3.2 The initial conditions 


The initial values U° in D are set to those obtained by a transfinite, trilinear interpolation (ref. 9) 
of the boundary data given by equation (3.7). Let 


P^ = ( 1.-0 uW(0,r,,Q +£u( m \l, r,,Q 
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3.4 The Numerical Scheme 


Starting from the initial values prescribed by equation (3.9), we seek some discrete approximation 
€ D to the steady-state solution U* of equation (3.1), through the numerical solution of the nonlinear 
system of PDE’s using a pseudo-time marching scheme and a spatial discretization procedure based on 
finite difference approximations. 

3.4.1 Implicit time differencing 

The independent temporal variable r is discretized to produce the set 

D r = {r n :ne [0,N]} 

where the discrete time increment At is given by 

r n = r„_i + At = nAr (4.1) 

Also the discrete approximation of U on D T is denoted by 

U(t) « \f (tiAt) = U" (4.2) 


A generalized single-step temporal differencing scheme for advancing the solution of equation (3.1) 
is given by reference 10 


AU” = 


0 At dAip At air 1 e A¥m _ x 

(1 + 0) dr + (1 + 0) dr + (l+0) Air 


+OK0- 5 


- 0)At 2 + At 3 ] 


(4.3) 


where the forward difference operator A is defined as 

AIT 1 = U n+1 - U n (4.4) 


With the appropriate choice for the parameters 0 and 0, equation (4.3) reproduces many of the 
well known two- and three-level explicit and implicit schemes. In this particular case, we are interested 
only in the two-level, first-order accurate, Euler implicit scheme given by (3 = 1 and 0 = 0, i.e.: 


AU" = At 


dAU” 

dr 


+ At 


air 1 

dr 


+ 0[At 2 ] 


(4.5) 


Substituting for (d AIT 1 /dr) and (dV/dr) in equation (4.5), using equation (3.1a), we get 

AIT 1 -A r a ( AEn + ATn ) aCAF” + AV n ) a(AG n + AW n ) 

T[ de + dr> + ac 1 


+ + ggjA: + + AtH 


(4.6) 


where AE n = E n+1 - E n and E n+1 = E(U n+1 ) etc. 
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Equation (4.6) is nonlinear in A IT 1 as a consequence of the fact that the increments AE”, AF”, 
AG n , AT", AV n and AW” are nonlinear functions of the dependent variables U and its derivatives 
U^, U,, and U^. A linear equation with the same temporal accuracy as equation (4.6) can be obtained 
by a linearization procedure using a local Taylor series expansion in time about U” (refs. 11 and 12); 

E” +1 = E” + (?) n Ar + 0(Ar 2 ) 

dT (4.7) 

= E" + (|5)"(^) n Ar + 0(AT 2 ) 


U” +1 =U” + (^— )”Ar + 0(Ar 2 ) 

OT 


Then, by combining equations (4.7) and (4.8), we get 


E n+1 = E” + (^)”(U” +1 - U”) + 0(Ar 2 ) 
a U 


AE" = A n (U)AlT + 0 (At 2 ) 


(4.10) 


where A(U) is the Jacobian matrix (5E/<9U). 

It should be noted that the above non-iterative time-linearization formulation does not lower the 
formal order of accuracy of temporal discretization in equation (4.6). However, if the steady-state 
solution is the only objective, the quadratic convergence of the Newton-Raphson method (for sufficiently 
good initial approximations) is recovered only as At — * oo. 

Similarly, the linearization of remaining terms gives 
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When the approximations given by equation (4.11) are introduced into equation (4.6), vs 
the following linear equation for AU": 
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It should be noted that the notation of the form 
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is used to represent the expressions as such 


d[(A + M - N^) n AU”] 

ak etc - 

The left hand side (i.e., LHS) of equation (4.13) is referred to as the implicit part and the right 
hand side (i.e., RHS) as the explicit part. 

The solution at the advanced time, r = (n + l)Ar, is given by 

U n+1 =11* + AIT* (4.14) 

The Jacobian matrices for the problem under consideration are given by 
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W ( 5 ) u ( 4 ) 

CM = {*.[“,„] -2 9 }[“ (1) ] 




,* 2 w[» (2 ¥ 4 ^ (3 ¥ + 3 [«< 4 >] 2 , 
C51 = ( 2 )( |„(Dp } 

tt (5) 



(M - N { ) = [0] 
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N = 


d (1) 

®€ 


-[4./3.]fc 3 A:4(u (2) /[« (1) ] 2 ) 


-^^(ttW/M 1 )] 2 ) 


-Ar 3 fc 4 («< 4 )/[ti (1) ] 2 ) 


where 


”51 


0 

d[ 2) + 

[4./3.]fc 3 fc4(l./« (1) ) 


0 


0 


”52 


0 

0 

dfh 

k 3 k4(l./u^) 

0 

”5 3 


0 

0 

0 

4 4 >+ 

fcstafl./ut 1 )) 

”54 


”51 = - [(4./3.)^ 3 fe4 - ^l^ 3 fc4*5]([« (2) ] 2 /[ u(1) ] 3 ) 

- [fc 3 fc 4 - fclfc 3 fc4fc5]([« (3) ] 2 /[« (1) ] 3 ) 

- [fc 3 fc 4 - M 3 M 5 ]([u (4 )] 2 /[u (1) ] 3 ) 

- kiksk^ksiu^ /[uW] 2 ) 

”52 =([4./3.]fc 3 fc 4 - fcifc 3 A: 4 fc5)(« (2 V[« (1) ] 2 ) 

”5 3 =(k 3 k4 - fcifc 3 fc 4 A:5)(u( 3 ty[u^) 2 ) 

”54 =(k 3 k4 - Ajifc^A^Xu^ty^ 1 )] 2 ) 

”55 =<^ + (k\ k 3 kik 3 )(l./ u^) 

(P - Qr,) = [0] 


0 \ 

0 

0 

0 

”55/ 


4 1 ’ 

-A: 3 fc 4 (tt( 2 VM^] 2 ) 


0 

4 2) + 


Q = 


*3*4(1-/”^) 
-(4./3.)fc 3 fc4(”^ 3 V[ u ^^] 2 ) 0 


0 

0 


4 3) + 


-k 3 k 4 (u^ /[u^] 2 ) 


0 


(4./3.)* 3 Ml-/” (1) ) 

0 


0 

0 

0 


4 4) + 

/c 3 fc 4 (l./u( 1 )) 


0 \ 

0 

0 

0 


951 


952 


95 3 


954 


955/ 
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where 


951 = - [*3*4 - *l*3*4*5]([u (2 ¥/l“ (1) ] 3 ) 

- I(4./3.)*S*4 " *l*3*4*5]([u (3) ] 2 /[“ (1) ] 3 ) 

- [*3*4 — *1 *3*4*5] ([« (4 ¥/[« (1) ] 3 ) 

- *1 *3*4*5 (u( 5 V[ M ^) 2 ) 

952 =(*3*4 - *l*3*4*5)(u (2) /[ u(1) ] 2 ) 

953 =([4./3-]*3*4 “ *l*3*4*5)(u* 3 V[ u ^] 2 ) 

954 =(*3*4 - *l*3*4*5)(u (4) /[« (1) ] 2 ) 

955 + (*1*3*4*5)( 1 -/^ 1 ^) 


(R - S C ) = [0] 


S = 


' d d) 

d < 

-*3*4(W (2) /[« {1) ] 2 ) 

-*3*4(^ 3 V( u ^^] 2 ) 

— (4./3.)*3*4(^ 4 VM^l 2 ) 
V S51 


0 

<f+ 

*3*4(1 ./u (1) ) 
0 

0 

«52 


0 

0 

<f+ 

*3* 4 (l./w( 1 ^) 

0 

«53 


0 

0 

0 


4 4 4 


0 \ 

0 

0 

0 


(4/3.)*3*4(l> {1) ) 

«54 s 55 / 


551 = “ [*3*4 “ *1*3*4*5]([“ (2) ] 2 /[ m(1) ] 3 ) 

- [*3*4 - *1 *3*4*5] ([u^f/M 1 *] 3 ) 

- [(4./3.)fc 3 *4 - *1*3*4*5](M 4) ] 2 /[« (1) ] 3 ) 

- *i*3*4*5(tl( 5 VM 1 )] 2 ) 

552 =(*3*4 ~ *1*3*4*5)( u ^ 2 VM 1 ^] 2 ) 

5 53 =(*3*4 “ *1*3*4*5)( u ^ 3 V[^^] 2 ) 

$54 =([4./3.]*3* 4 - *i*3*4* 5 )(u( 4 VM^] 2 ) 

S55 =d^ + (* 1 *3*4*5)(l./w^ 1 ^) 
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3.4.2 Spatial discretization 

The independent spatial variables (*,i7,0 are discretized by covering D, ( the closure of D), with 
a mesh of uniform increments ( h 0 hr,, \ ) in each of the coordinate directions. The mesh points m 
the region will be identified by the index-triple (i,j, k), where the indices t € [1, Ntf, j € [1, N „ J and 
k G [1, JNTJ correspond to the discretization of f, r) and £ coordinates, respectively. 

D h UdD h = {fo,Vj,Ck) :l<i<N^l<j<Nr,,l<k<N c } 

ti = (i- 1 )hf, T)j = U - 1)V Ck = ( k ~ l ) h C ( 4 - 15 ) 

and the mesh widths are given by 

^ = l./(JV € - 1); hr, = l'/(Nr,-lY, h c = l./{N c -l) (4.16) 

with (N^, Nr,,N^) G N being the number of mesh points in f— , 77 — and ^—directions, respectively. 

Then, the set of interior mesh points is given by 

D h = Ot) : 2 < i < (N( ~ 1). 2 <;<(«,- 1),2 < fc < (N ( - 1)} 

and the boundary mesh points by 

dD h = {(£i,Vj,Qk) : i € {1 ,N*}} U {(&,i7j,Gfc) : 3 € {1,^}} U {(fc.iy.Ck) = * € {1,JV C » 


Also, the discrete approximation of U in (D x Z? r ) is denoted by 

C/ (r, £, 77 , C) = Ul(nAr, (i - l)h ( , ( j - l)h v , (k - 1 )h c ) = U^ k 


(4.17) 


3.43 Spatial differencing 

The spatial derivatives in equation (4.13) are approximated by the appropriate finite-difference 
quotients, based on the values of U T h at mesh points in D h U dD h . We use three-point, second-order 
accurate central difference approximations in each of the three coordinate directions. 

In the computation of the finite-difference approximation to the RHS, the following two general 
forms of spatial derivatives are encountered, i.e.: 

de( m )(U) 

dt 

and , . 

at( m )(u,u^) 

ae 
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(4.18) 


The first form is differenced as (using m-th component vector function E as an example) 

The second form is approximated as (using m-th component of vector function T as an example) 

^ (m) (U,U^) lfn \ f ,Ut-n.7.fc + U M,fc ^ , u »+ i J£ ~ u t \1 


_ + (HiiZ!* — U{ ~ 1 ^ )]} + 0{h\) 

2 he 


(4.19) 


for 2 < i < (7V^ - 1 ). Similar formulas are used in the 77 - and C~ directions as well. 

During the finite-difference approximation of the spatial derivatives in the implicit part, following 
three general forms are encountered: 

a[aW)(u)Au(')] 

0[ m ( m ,O(U,U e )AuW] 


and 


9 2 [n( m -')(U)Au^| 

W 


The first form is approximated by the following: 

* (4.20) 

The second form is differenced in the following compact three -point form: 




(4.21) 


Finally, the third form is differenced as follows: 
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(4.22) 


d 2 [n ( m ’*)(U)AuW] 

ee 



“i+lj.ls 

- [(2// t |){n(’"-'>(U jJit )})Aufj k 

+ [(l/fc|){n<"*' l >(U j _ 1J j t )}]Au^ 1J ., t 


3.4.4 Added higher order dissipation 

When central difference type schemes are applied to the solution of the Euler and Navier-Stokes 
equations, a numerical dissipation model is included, and it plays an important role in determining 
their success. The form of the dissipation model is quite often a blending of second-difference and 
fourth-difference dissipation terms (ref. 4). The second-difference dissipation is nonlinear and is used to 
prevent oscillations in the neighborhood of discontinous (i.e., C°) solutions, and is negligible where the 
solution is smooth. The fourth-difference dissipation term is basically linear and is included to suppress 
high-frequency modes and allow the numerical scheme to converge to a steady state. Near solution 
discontinuities, it is reduced to zero. It is this term that affects the linear stability of the numerical 
scheme. 


In the current implementation of the synthetic problem, a linear fourth- difference dissipation term 
of the following form: 

^ir 1 u4 d*i r , 

' L h 1- h f 


~ At£ [ + 


drf 1 


3C 4 


■] 


is added to the right hand side of equation (3.1a). Here, e is a specified constant. A similar term with 
U” replaced by A IT 1 will appear in the implicit operator, if it is desirable to treat the dissipation term 
implicitly as well. 

In the interior of the computational domain, the fourth-difference term is computed using the 
following five -point approximation: 


,a 4 u n , 


- 4U ?-1 ** + "t*** < 4 - 24a > 

for 4 < i < ( N — 3). 


At the first two interior mesh points belonging to either end of the computational domain, the 
standard five-point-difference stencil used for the fourth-difference dissipation term is replaced by one- 
sided or one-sided biased stencils. These modifications are implemented so as to maintain a non-positive 
definite dissipation matrix for the system of difference equations (ref. 16). 


Then, at i = 2: 


,d 4 U n 


h\^r talk * UT+2J,* - + 51^, 


(4.24b) 
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and at i = 3: 


Also at i = — 2: 


lij.it * iw - 4U ? + 1 jjb + - 4U ?- 




(4.24c) 


and at i = Ne — 1: 


* - 4U !W + 6U u.* - 4U ?-wa + 
h t^rb.* “ - 4U ?-w.* + «?-w 


(4.24 d) 


(4.24e) 


Similar difference formulas are also used in the rj— and £— directions. 

Also, for vector functions T, V and W used here: 

(M - N e ) = 0, (P - Q v ) =0, (R - S C ) = 0 
Then, for the cases where added higher order dissipation terms are treated only explicitly, equation (4.13) 


reduces to 


a a [ a(A) ” + 9 < N) " + a(B) " + ^( Q ) + ^ + ^§-]}AU" = 

{I - At[ -§t ~w sc + a? 

A r d(E + T) n d(F + V) n ( d(G + W) n , 

At1 a? _ + a? ac 1 
r ^u 71 

- AT£ h-d ^~ + h v-^jr + h <~dC r] 

+ ArH* 


(4.25a) 


When the implicit treatment is extended to the added higher order dissipation terms as well, 
equation (4.13) takes the following form: 

,. . , 8(A)” . 8 2 (N)" 4 8 4 (I) 

(I_At[ ~§t + ~w~~ ^ 

, 8(B)” . 8*(Q)” .^(I) 

+ ~a^T + ~W ^8r, 4 

8(C)” , S 2 (S) n ,.48 4 (I)„ AII n_ 

+ * + " £ft <"ac rl}AU - ( 4 - 25i> ) 

. t 8(E + T)” . 8(F + V) n , 8(G + W)", 

Ar l — a? — + — + oc J 


d£ drj 

^4 

+ ArH* 


^U 71 . , 4 d 4 U 71 , ^U 71 , 


+ « 


ac 4 
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The modified vector forcing function is given by: 

8E(U*) 9F(U*) 0G(U*) 

H ({,u,C) = -[- 5 j- + -^-+ a( 

0T(ir,uj) av(u*,u;) 0W(u*,uj) 

+ aj + a? + a< 1 

+ 4^ + ^ + *^], for «,„,<) €DxD, 


(4.26) 


3.4.5 Computation of the explicit part — -RHS 

The discretized form of the explicit part of equation (4.25) is given by: 

[RHS] n | < j, * w Ar[D e (E + T) n + D v ( F + V) n + £> C (G + W ) n ]| iJtk 

- At b [hjDfir + U" + h^U"] |< (4-27) 
+ Ar[H 

where D* , D v and are second-order accurate, central spatial differencing operators, defined in ZV 
At each point in the mesh, [RHS] * is a column vector with 5 components. Discretization of added 
higher order dissipation terms was already discussed in section 4.4. Here we consider the computation 
of the first term in equation (4.27), using formulas given in equations (4.18) and (4.19): 

p*(E + T) n + A,(F + V) n + D C (G + Vf) n }\ij,k = (l/2h ( )[E(U? +1Jffc ) - E(U?_ 1Jtfc )] 

wm ¥TT1 


+ (1/^){T[( 


^T+l ,j,k + ^ t 1 , j,fc ^i,j,k 

2 M 


+ (l/2h,)IF(U? J+lifc ) - F(urj-i, fc )] 

+ (i/M { v[(^y^). ( U ^ +1 ^~ U ^)] 

+ (l/2/t C )[G(Ur j -,j fc+ i)-G(U^- ifc _ 1 )] 

+ (1Ac){ W [( -^ ± T~ ( ^’^1 — ’* )] 


(4.28) 


for {(i, j, fc) € £>/,}• Also, [RHS]”,- k = 0 for i = 1 or A^, j = 1 or N,, and A: = 1 or 7V C . 

This right hand side computation is equivalent in terms of both arithmetic and communication 
complexity to a regular sparse block (5 x 5) matrix-vector multiplication, which is the kernel (c). 
However, its efficient implementation, in this case, does not necessarily require the explicit formation 
and/or storage of the regular sparse matrix. 
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3.4.6 Computation of the forcing vector function 

Given the analytical form of U*, the forcing vector function H* can simply be evaluated analytically 
as a function of £, q, and <, using equation (4.26). This function, along with equation (4.15), can then 
be used to evaluate [H*]^, which is also a colunm vector with 5 components. 

Here, we opt for a numerical evaluation of [H*] iJ)fc , using [U*]ij,jfc and the finite-difference ap- 
proximations of equations (4.18) and (4.19), as in the case of equation (4.28). 


[H*]lij,fc « (l/2^)[E(Ur +1J , fc ) 


+ (1/^){T[(- 


v i+u,k + Yk* ), ( V * i+l 'j' k 


)] 




+ (l/2M[F(U*j + i, t )-F(Uij-l,i)] 
+ 


+ (l/2/i^)[G(U*j^ +1 ) — G(U£j^_j)] 


(4.29) 


ht 


+ e [hpiv* + + fcjajiniij,* 

for {(z, j,k) € D/J. The fourth difference dissipation terms are evaluated using equation (4.24). Also, 
[U]* j k = 0 for z = 1 or N$, j = 1 or Nr, and k = 1 or N c . 

3.4.7 Solution of the system of linear equations 


Replacing the spatial derivatives appearing in equation (4.25) by their equivalent difference ap- 
proximations results in a system of linear equations for [AlF 1 ]^* for i € [2 ,N^ - 1], j £ [2 ,Nr, - 1] 
and k € [2 ,Nr - 1]. Direct solution of this system of linear equations requires a formidable^matnx 
inversion effort, in terms of both the processing time and storage requirements. Therefore, AU n is ob- 
tained through the use of an iterative method. Here we consider three such iterative methods, involving 
the kernels (a), (b) and (d). All methods involve some form of approximation to the implicit operator 
or the LHS of equation (4.25). For pseudo-time marching schemes, it is generally sufficient to perform 
only one iteration per time step. 
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3.4.7.1 Approximate factorization (Beam-Warming) algorithm- In this method, the implicit 
operator in equation (4.25a) is approximately factored in the following manner (refs. 1 and 9): 

{i- Mt? + x V - + AU " “ ' ““ 

5 (4.30) 

The Beam-Warming algorithm is to be implemented as shown below: 

Initialization 

Set the boundary values of for (i,j, k) € dD h in accordance with equation (3.7). 

Set the initial values of U?, t for (i,j, k) € D h in accordance with equation (3.8). 

*wr 

Compute the forcing function vector, ^ for (i,j,k) € Dh, using equation (4.29). 

Step 1: Explicit part 

Compute the [RHS]7 j Jt for ( i,j,k ) € Dh- 
Step 2: f-Sweep of the implicit part 

Form and solve the following system of linear equations for [AUijjj fc for (i,j, k ) € Dh- 

{I - Arp € (A) B + D\{ N) n ]}AUi = RHS 

Step 3: 77 -Sweep of the implicit part 

Form and solve the following system of linear equations for [Al^Jij.jfc f° r (h ji k) € Dh- 

{I - Ar[D,(B)" + I)2(Q) n ]}AU 2 = AOj 

Step 4: £ -Sweep of the implicit part 

Form and solve the following system of linear equations for [AU”],-^ for (i,j,k) € Dh- 

{I - Ar[D c (C) n + I>^(S) n ]}AU n = AU 2 

Step 5: Update the solution 


[U" + 1 ]i J,* = mijjk + for (i, j, k) € D h 

Steps 1-5 consist of one time-stepping iteration of the Approximate Factorization scheme. The 
solution of systems of linear equations in each of the steps 2-4 is equivalent to the solution of multiple, 
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independent systems of block tridiagonal equations, with each block being a (5 x 5) matrix, in the three 
coordinate directions £, rj, C respectively. For example, the system of equations in the £-sweep has the 
following block tridiagonal structure: 

+ [Bij.tllAU.1^,, + [Cjj j i][AUi]i +1 j 1 i = (RHSI..M 2 < i < N ( - 1 
[•Ajv e j,t][AUi]jv { _i j,i + [ B A' t j,)c][Alli]N £ jJc - |RHS] W(J ,t 

where (j € [2, A^-l]) and (k € [2, N ( - 1]). Also, [-4], [B] and |C] are (5 x 5) matrices and [AUi] { j,t 
is a (5 x 1) column vector. 

Here, for 2 < i < (N^ - 1), using equations (4.20) and (4.22): 

[Aijjfc] = -AT{(-l/2fc { )[A(U?_ ljJb )l + (l/^)[N(U"_ 1Jlt )]} 

[B iJ , t ] = l + AT(2/h|)[N(U? J ., l )] < 4 - 32 ) 

Pijji = -Ar{(l/2h f )[A(U? +1J . t )] + (l/(t|)|N(U? +1J , t )]} 

Also, = [I], Pi j,i] = [0],p4jv ( jdt] = [0]. and [Bw £ j,fc] = l 1 )- 

3.4.7 2 Diagonal form of the approximate factorization algorithm- Here the approximate fac- 
torization algorithm of section 4.7.1 is modified so as to transform the coupled systems given by die left 
han d side of equation (4.25b) into an uncoupled diagonal form. This involves further approximations in 
the treatment of the implicit operator. This diagonal ization process targets only the matrices A, B, and 
C in the implicit operator. Effects of other matrices present in the implicit operator are either ignored 
or approximated to conform to the resulting diagonal structure. 

The diagonalization process is based on the observation that matrices A, B, and C each have a set 
of real eigenvalues and a complete set of eigenvectors. Therefore, they can be diagonalized through 
similarity transformations (ref. 13): 


A = T f A 4 T ( -> 

B = T^T^ 1 (4-33) 

C = T C A C T^ 


with 

A £ = Diag [-{uW/uW), -(u (2) /u (1) ), -(u^/u^), -(u^/u^ + a), 

Arj = Diag [-(u (3) /w (1) ), -(tt (3) /u (1) ), -(u {3) /« (1) )> -(u (3) /« (1) + a )< 

A c = Diag [-(u (4) /u (1) ), -(u (4) /u (1) ), -(u (4) /u (1) ), -(u (4) /« (1) + a)> 


-(uW/uW - a.)] 

—(u^/u^ — a)] 

— (i/ 4 tyu^ — a)] 

(4.34) 


where 


a = 


N 


ci 


,«< 5 > „.,M 2) ] 2 + [“ <3) 1 2 + I“ <4) ] 2 „ 
^Si) -0 ' 51 pJf 11 


(4.35) 
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and T^(U), Tj^U), and T^(U) are the matrices whose columns are the eigenvectors of A, B, and C, 
respectively. When all other matrices except A, B, and C are ignored, the implicit operator is given by 


LHS=[/-Ar^][/-AT^]|/-Ar^] 


(4.36) 


Substituting for A, B, and C, using equation (4.33) in equation (4.36), we get 

3(T;A f T f - 1 ) n 

9(T ( A { T^ 1 )" 


LHS =[(T^ 1 ) n - At 


ae 


-] 

dr} 


x I(T,T^)" - At ’ ■■■ ” ] X ((T^ 1 )" - At- 


(4.37) 




JAU" 


A modified form of equation (4.37) is obtained by moving the eigenvector matrices T^, T^, and 
outside the spatial differential operators d/d£, d/dr}, and d/d£, respectively (ref. 13): 


LHS = XJ[I - At^K^HT,)"!! - At^^](T- 1 )"(T« : )'‘[I - At^^)(T^)"AU” 


This can be written as 

LHS = TJ[I - At ^M ]N[I - Ar a ^^- ]P[I - Ar^^-](T^ 1 ) n AU n 

where 

N = T^T^; N" 1 = T" 1 ^ (4.38) 

P = T- 1 T C ; p-l =T -% 

The eigenvector matrices are functions of f , rj and £ and therefore this modification introduces further 
errors of O(Ar) into the factorization process. 


During the factorization process, the presence of the matrices N, Q, and S in the implicit part 
were ignored. This is because, in general, the similarity transformations used for diagonalizing A do 
not simultaneously diagonalize N. The same is true for the tj and £ factors as well. This necessitates 
some ad-hoc approximate treatment of the ignored terms, which at the same time preserves the diagonal 
structure of the implicit operators. Whatever the approach used, additional approximation errors are 
introduced in the treatment of the implicit operators. We chose to approximate N, Q, and S by diagonal 
matrices in the implicit operators, whose values are given by the spectral radii p(N), p(Q), and p( S), 
respectively (ref. 17). In addition, we also treat the added fourth-difference dissipation terms implicitly. 


LHS = T£[I - Ar{ 


<9(Ag) n | a 2 [p(N) n I] 




d £ 2 


-«d2gi>] 




x S 1 I -at { ^ + ^!1 




}] 


x P[I - Ar{ 


£(A C ) n ^ d 2 [p(S)"I] 

3C 


+ 


n dr} 4 




(4.39) 
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The matrices T^ 1 , T c , N _1 , and P 1 are given by: 

(1 - [q/a 2 ]) c 2 [u( 2 Vu( 1 )]/a 2 

-(iiW/fttW] 2 ) o 

(u< 3 V[u (1 >] 2 ) 0 

r (q — a[«^/ w ^^]) a ( a ~ c 2M 2 V n ^l) 

\a(q + a[u( 2 V«^^j) _<7 ( a + c 2 [u( 2 V u ^]) 


v= 


c 2 [t/ 3 > /u^ 1 ^ ] /a 2 c 2 [u( 4 Vu (1) ]/a 2 

0 (l/^ 1 )) 

— (l/t/ 1 )) 0 

-oc2\u^ /vW) -ac2[u^ /u^) 
-<TC2[t/ 3 V u ^l —crc2[u^ /u^\ 


-[c 2 /a 2 ]\ 

0 

0 

<JC2 

OC2 } 


0 

0 

1 

0 

-u(l) 

[u(2)/u(l)] 

ud) 

0 

[u(3)/u(l)J 

0 

0 

[«(4)/«(l)] 

U 3) 

-u( 2 ) 

[Q/c 2 ] 


a[tx(2)/u(l)] 
a[u(3)/ti(l)j 
a([tt(4)/u(l)] + «) 
a[(g + a 2 )/c 2 + o(^ 4 V«^*)] 


a ^ 

q[u(2)/u(1)] 
q[u(3)/u( 1)] 
a([u(4)/u(l)] - a) 
a[(g + a 2 )/c 2 - a(t/( 4 V^^)] / 



(0 

1 

0 

0 

Vo 


-1 0 0 0 \ 

0 0 0 0 

0 0 l/\/2 -1/V2 

0 - l / v /2 1/2 1/2 

0 1/V2 1/2 1/2 / 



where a = l/(-\/2u^a) and a = 


0 0 0 l/v/2 -l/v/2 \ 

0 0-10 0 

0 10 0 0 

-l/v /2 0 0 1/2 1/2 

l / y /2 0 0 1/2 1/2 / 


In addition, the spectral radii of the matrices N, Q, and S are given by: 

p(N) = max(d ^ 

df ] + [4./3.]* 3 * 4 [l-/u (1) ] 
df ] + * 3 * 4 [l./u (1) ] 

+ ^3^:4 [l./te^^] 
df + *1 *3*4*5 [W« (1) ] ) 
p(Q) = max( 4^ 

4 2) + *3*4[l./ w ^^] 

4 3) + [4./3.]fc 3 * 4 [l./« (1) ] 

4 4) + *3* 4 [1./^ 1 ^] 

4 5) + *l*3*4*5[l-/ u ^] ) 


55 



p(S) = max{ d ^ 

df ] + k 3 k 4 {l./uM] 
df } + k 3 k 4 [l./uW] 
df + [4./3.]fc3^4[l./w {1) ] 

d ^ 4 - kik3k 4 ks[l./u ^] ) 

The explicit part of the diagonal algorithm is exactly the same as in equation (4.26b) and all the 
approximations are restricted to the implicit operator. Therefore, if the diagonal algorithm converges, 
the steady-state solution will be identical to the one obtained without diagonalization. However, the 
convergence behavior of the diagonalized algorithm would be different. 

The Diagonalized Approximate Factorization algorithm is to be implemented in the following order: 
Initialization 

Set the boundary values of Uj for (i,j,k) G dD^ in accordance with equation (3.7). 

Set the initial values of U? for (i,j, k ) € D h in accordance withequation (3.8). 

Compute the forcing function vector, H*^ for (i,j, k) G D^, using equation (4.29). 

Step 1: Explicit part 

Compute [RHSjJj^ for ( i,j,k ) € D^. 

Step 2: 

Perform the matrix-vector multiplication: 

IAUj] = (T^riRHS] 

Step 3: £-Sweep of the Implicit part 

Form and solve the following system of linear equations for AU 2 : 

{I - Ar[L>^(A^) n ] - Ar[L>|(p(N) n I)] + At[£/i|d|(I)]}[AU 2 ] = [AUi] 

Step 4: 

Perform the matrix-vector multiplication: 

[AU 3 ] = N _1 [AU 2 ] 
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Step 5 : 77-Sweep of the implicit part 

Form and solve the following system of linear equations for AU4: 

{/ - AT[Drf(A v ) n ] - Ar[D*(p(Q) n I)] + Ar[e/ 4 ^(I)]}[AU 4 ] = [AU 3 ] 


Step 6: 

Perform the matrix-vector multiplication: 

[AU 5 ]=P- 1 [AU4] 


Step 7 : C-Sweep of the implicit part 

Form and solve the following system of linear equations for AUg: 

{I - Ar[£> c (A c ) n ] - Ar[D^(p(S) n I)] + Ar[e^r>J(I)]}[AU 6 ] = [AU 5 ] 


Step 8: 

Perform the matrix-vector multiplication: 

[Air*] = t c (au 6 ] 


Step 9 : Update the solution 


U n+1 = U 11 + AU” 

Steps 1-9 constitute of one iteration of the Diagonal Form of the Approximate Factorization algorithm. 
The new implicit operators are block pentadiagonal. However, the blocks are diagonal in form, so 
that the operators simplify into five independent scalar pentadiagonal systems. Therefore, each of the 
steps 3, 5, and 7 involve the solution of multiple, independent systems of scalar pentadiagonal equations 
in the three coordinate directions ^,77 and C respectively. For example, each of the five independent 
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scalar pentadiagonal systems in the f -sweep has the following structure: 

ci^[AU 2 ]S^ 4- d^jAUtf +.U, fclAiygJb = [AUilgji 
baj, fclAUalgi + c 2J , fc [AU 2 ]g> + d 2J)fc [AU 2 ]g, + e 2j(fc [AU 2 ]g fc = [AU!]g, 

+ e i j,fc [AU 2 ]|^2 j,jt = l A "l]&i. for 3 * i <(*€ - 2) 
®!V c -lJ 1 ik[ AU 2]A^ ) -3j,Jfc + b JV r t-lj f *[ AU 2]i^L 2 J f Jk + c JV € -lj,*[^U2]j^Lij tfc 

+ d N^-ljM AV ^N^J,k = ( AU l]^-lJ,ik 

*^,j,fc[AU2]^ ) _ 2jifc + b ^j,i[AU 2 ]^ ) _ij,fc + ew € j\fc[AU 2 ]^ j fJb = [AUilj^^ 

‘ (4.40) 

where, (j € [2,/^ - 1]), (k e [2 ,N^ - 1]) and (m € [1,5]). The elements of the pentadiagonal matrix 
are given by the following: 


C 1 J,k = !•; d l J,k = °M e lj\* = 0 

» 2 j,i = Ar(l/2h f )[(A e )"]^y ) - Ar(l/^)WN)“] WJt 
C 3J,t = 1. + Ar(2/Af)[p(N)") 2j , t + Are (5) 

d 2J , t = -Ar(l/2h ? )!(A ? )"]<”'™ > - Ar(l/A|)[p(N)"] 3j - t + Are (-4) 
®2 J,k = At£ 


(4.41a) 


(4.416) 


“3 ,j,k = 0.0, 

KSjVfc = Ar(l/2A e )[(A 4 )"]<™'” ) - Ar(l/fc|) [p(N)"] 2J , t + Ar £ (-4) 

c 3Jtk = 1. + Ar(2/fcf)[p(N) n I 3 j, t + Ar £ (6) (4.41c) 

*3j,k = -Ar(l/2A ? )[(A ? )’‘)<™'” ) - Ar(l/hf)(p(N)"] 4>M + Arc (-4) 

•3j,i = Are 


*ij,k = A T£ 

b(j,t = Ar(l/2A { )[(A { )"]S"|"J t - Ar(l/ftf)(p(N)"]i-i j,* + Ar £ (-4) 

cij,t = 1. + Ar(2/A|) WNHyjk + Ar e (6) <4.41d) 

dy, t = -Ar(l/2A ? )[(A { )"];™;™> fc - Ar(l/A|)WN)"] j+w ,fc + Art (-4) 

= AT£ 


y 
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& N^—2J,k = Are 

b W{ -2j,i = Ar(l/2fc ? )[(Aj)"]^f 3 ) J . t - AT(l/^)WN) n U s - 3 j,t + Are (-4) 
c^_ 2 j,fc = 1. + Ar(2/h|)[p(N) n ]^_2j,fc + Are (6) (4.41e) 

dW { - 2 j,fc = -Ar(l/2A { )[(A ? )"]^f 1 ) i> , t - Ar(l/*|)WN)"]iv e _ Wlt + Are (-4) 
e N i -2J,k = 0-0- 


*Nz-lj,k = Are 

b W{ -lj,k = AT(l/2h 4 )[(A ( rjj£^ - Ar(l/h|)WN)"]AT ( _ 2 j,* + Are (-4) 
c^-i j,k = 1- + Ar(2/h|)[p(N) n ] iV? -ij,* + Are (5) 
d N ( -lJ,k = -Ar(l/2^)[(A^) n ]^’ J J - Ar(l//i|)[p(N) n ]^ J) * ; 


(4.41/) 


& N(j,k ~ 0 -; h N^J,k ~ 0*5 c N^J,k 1 ( 4 - 415 ) 

3.4.73 Symmetric successive over-relaxation algorithm— In this method, the system of linear 
equations obtained after replacing the spatial derivatives appearing in the implicit operator in equa- 
tion (4.25a) by second-order accurate, central finite difference operators, is solved using the symmetric, 
successive over-relaxation scheme. Let the linear system be given by 

[K^AlT 1 ] = [RHS n ] (4.42) 


where 

K n = {I - Ar[D^A n + D\ N n + D v B n + D\ Q n + D c C n + £>^S n ]} Alf 1 , for (i,j, k) € D h 

It is specified that the unknowns be ordered corresponding to the gridpoints, lexicographically, such 
that the index in f -direction runs fastest, followed by the index in ^-direction and finally in C-direction. 
The finite-difference discretization matrix K, resulting from such an ordering has a very regular, banded 
structure. There are altogether seven block diagonals, each with a (5 x 5) block size. 

The matrix K can be written as the sum of the matrices D, Y and Z: 

K n = D n + Y n + Z n (4.43) 


where 


D n = Main block — diagonal of K n 

Y n = Three sub — block — diagonals of K n 

Z n = Three super — block — diagonals of K n 
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Therefore, D is a block-diagonal matrix, while Y and Z are strictly lower and upper triangular, respec- 
tively. Then the point-SSOR iteration scheme can be written as (refs. 14 and 15): 

[X^AU 71 ] = [RHS] (4.44) 


where 


X" = w(2. - u>)(D n + wY n )(D n ) _1 (D n + uZ n ) 
= u>( 2. - u/)(D n + wY n )(I + a;(D n )" 1 Z n ) 


(4.45) 


and u) € (0.,2.) is the over-relaxation factor (a specified constant). The SSOR algorithm is to be 
implemented in following manner: 

Initialization 

Set the boundary values of in accordance with equation (3.7). 

Set the initial values of U? l in accordance with equation (3.8). 

Compute the forcing function vector, H*^. for (i,j, k) € D^, using equation (4.29). 

Step 1: The explicit part 

Compute [RHS]”^ for (i,j, k) 6 D^. 

Step 2: Lower triangular system solution 

Form and solve the following regular, sparse, block lower triangular system to get [AUi]: 

(D n + u;Y n )[AUi] = [RHS] 


Step 3: Upper triangular system solution 

Form and solve the following regular, sparse, block upper triangular system to get [AU n ]: 

(I + w(D n ) -1 Z n )[AU n ] = [AUi] 


Step 4: 

Update the solution: 

U n+1 = IT 1 + [l/w(2. - o>)]AU n 


Steps 1-4 constitute one complete iteration of the SSOR scheme. The l-th block row of the matrix K 
has the following structure: 


[>l/][AU n ] / _(^_2)(^ r; _2) + [^i][^U n ]/_(^_2) + [CiHAU”]/.! + [P,][AU»], 
+ [£/][AU n ]/ +1 + [^([][AU n ] /+ (^_2) + [^z][AU n ]/ + (^_2)(iv fJ _2) = [RHS]/ 
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where ( = i + (N ( - 2)0 - 1) + (W{ - 2)(JV„ -2 )(k - 1) and (i, j, k) <= D h . The (5 x 5) matrices 
are given by: 

[A,] = - ArO/^HSlU?^.!)) 

[B ( ] = -Ar(-l/2h,)(B(U7 J _ 1-t )] - Ar(l/^)(Q(U?j_i, t )) 

[C,) = — Ar(— l/2h^)[A(Uf_ij t jt)] - Ar(l/h|)[N(l)?_i j, t )] 

[ V,\ = I + Ar(2/h|)[N(U? J , t )) + Ar(2/fc2)[Q(Uf J j)) + At(2/^)[S(U? Ji t )] (4.47) 

|f|] = -Ar(l/2h ? )[A(U? +u , *)] - Ar(l/h|)[N(U? +Wtit )] 

Fl) = -Ar(l/2A,)[B(U? J+U )] - AT(l/h5)[Q(U? J+1 , t )] 

[ft] = -Ar(l/2h < )[C(U? | j, t+1 )] - Ar(l/^)[S(U? Jil+1 )] 

Also, A, B and C are the elements in the Z-th block row of the matrix Y, whereas £>T and Q are the 
elements in the Z-the block row of the matrix Z. 

3.5 Benchmarks 

There are three benchmarks associated with the three numerical schemes described in section 4, 
for the solution of system of linear equations given by equation (4.25). Each benchmark consists of 
r unnin g one of the three numerical schemes for N s time steps with given values for N^, N^, and 
At. 


3.5.1 Verification test 

For each of the benchmarks, at the completion of N s time steps, compute the following: 

1. Root Mean Square norms RMSR(m), of the residual vectors [RHS( m )]"j^, for m = 1, 2, . . . , 5 
where n = N s , and (i, j, k ) € D ^ 


RMSR{m ) 


'i (N ( - 2 )(W„ - 2 )(N ( - 2) 


(5.1) 


/ \ ^ \ 

2. Root Mean Square norms RMSE(m) of the error vectors {[U*]>™[ “ for m = 

1, 2, . . . , 5 where n — N s , and ( i , j , k) e D ^ 


RMSE(m) = 


\ 


^k = 2 t->j = 2 ^t=2 

{N~Z - 2)(W„ - 2)(N C - 2) 


(5.2) 
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3. The numerically evaluated surface integral given by 


J2-1 *2-1 

I = 0.25{ ^2 ^ + <Pi+i j t k\ + ^tj+l,*i + V’i+l ,j+l,fci 

3=31 *=*l 

+ J,fc 2 ^*+1 J>*2 V 7 * J+l.*2 ^*+1 J+l,^] 

/c2 — 1 t2~ 1 

+ 21 21 + ¥>i+lji,fc + V’tJi.fc+l + V>i+lji^+l /c 

k=ki t=*i 

+ V>tJ2,* + W+1J2,* + ^tj 2 .*+l + ^*+ 1 J2.*+ll 
*2~1 >2-1 

+ 21 S J,k + Y’iu'+U + V^ij J,*+l + ^*i J+U+l 

*=*1 3=31 

+ ^12 J,k + ^*2J+1»* ^*2 Ji*+1 V’^J+ljfc+l]} 

where i\,i 2 ,ju J2> and *2 are specified constants, such that 1 < i\ < i 2 < A^, 1 < j\ < 32 < N^ 
and 1 < < *2 < A^, and 


r f 51 n r /[ w(1) ] 2 + l « (2) ] 2 + [“ (3) ] 2 )^ 

<p = c 2 {v> > - 0.5( )} 


The validity of each these quantities is measured according to 

\Xc -Xrl 
\Xrl ~ 


(5.4) 


where X c and X r are the computed and reference values, respectively, and e is the maximum allowable 
relative error. The value of e is specified to be 10 -8 . The values of X T are dependent on the numerical 
scheme under consideration and the values specified for N^, A/^, N^, At and N s . 

3.5.2 Benchmark 1 

Class A: Perform N s = 200 iterations of the Approximate Factorization Algorithm, with the 
following parameter values: 


= 64; Nrj = 64; = 64 


and 


At = 0.0008 

Timing for this benchmark should begin just before the Step 1 of the first iteration is started and end 
just after the Step 5 of the A^-th iteration is complete. 


Class B: Same except A^ = = 102 and Ar = 0.0003. 



3.5 3 Benchmark 2 


Qass A: Perform N s = 400 iterations of the Diagonal Form of the Approximate Factorization 
Algorithm, with the following parameter values: 

N ( = 64; N v = 64; N Q = 64 


and 

At = 0.0015 

Timing for this benchmark should begin just before the Step 1 of the first iteration is started and end 
just after the Step 9 of the 7V s -th iteration is complete. 

Qass B: Same except = Nr, = = 102 and At = 0.001. 

3.5.4 Benchmark 3 

Qass A: Perform N s = 250 iterations of the Symmetric Successive Over-Relaxation Algorithm 
with the following parameter values: 

= 64; Nr, = 64; N c = 64 


and 


At = 2.0 u = 1.2 

Timing for this benchmark should begin just before the Step 1 
just after the Step 4 of the AT s -th iteration is complete. 


of the first iteration is started and end 


Qass B: Same except = Nr, = = 102. 


For all benchmarks, values of the remaining constants are specified as 


k\ = 1.40; 

k 2 = 0.40; 

k 3 = 0.10; 

k 4 = 1.00; 

k 5 = 1.40 

c u = 2.0 

C 2 ,i = 1.0 

C 3 ,i = 2.0 

C 4 ,i = 2.0 

C 54 = 5.0 

o 

o 

II 

<N 

o 

C 2 , 2 = O.o 

C 3 , 2 = 2.0 

C 4 , 2 = 2.0 

C 5, 2 = 4.0 

£ 

CO 

II 

p 

O 

C 2 ,3 = 0.0 

c 3 , 3 = 0.0 

c 4 , 3 = 0.0 

C 5 , 3 = 3.0 

iu 

II 

O 

c 2 , 4 = 0.0 

C 3 ,4 = o.o 

C 4i4 = 0.0 

C 5 , 4 = 2.0 

Cl, 5 = 5.0 

C 2 ,5 = 1.0 

c 3 ,5 = o.o 

C 4 ,5 = O.o 

C 5 , 5 = 0.1 

Ci, 6 = 3.0 

C 2 ,6 = 2.0 

C 3 ,6 = 2.0 

C 4 ,6 = 2.0 

C5,6 = 0.4 

Ci j = 0.5 

<?2,7 = 3.0 

C 3 ,7 = 3.0 

C 4 ,7 = 3.0 

C5,7 = 0.3 

Cl, 8 = 0.02 

C 2 ,8 = 0.01 

C 3 ,8 = 0.04 

C 4 , 8 = 0.03 

C 5 ,8 = 0.05 

Ci, 9 = 0.01 

C 2 ,9 = 0.03 

C 3 ,9 = 0.03 

C 4 , 9 = 0.05 

C 5 ,9 = 0.04 

Cl, 10 = 0.03 

C 2 ,io = 0.02 

C 3 ,io = 0.05 

C 4 ,io = 0.04 

C 540 = 0.03 
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Cl, li 

= 0.5 

C 241 : 

= 0.4 

C 3.11 

= 0.3 

C 4.11 = 0.2 

C 5.11 

= 0.1 

Cl, 12 

= 0.4 

C 242 : 

= 0.3 

C3,12 

= 0.5 

C 442 = 01 

C 542 

= 0.3 

Cl, 13 

= 0.3 

C2,i3 : 

= 0.5 

C 343 

= 0.4 

C 443 = 0.3 

C 543 

= 0.2 




= df ] -- 

= 4 3) = 

II 

df ] = 0.75 





4 1 ’ 

= 4 2) = 

=4 3, = 


d?> = 0.75 






Cm\ 

*a J ‘ 

II 

II 

-s 

II 

II 

V 

<f[ 5) = 1.00 




e = [max^^jd^d^l^.O 
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